[PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention
@ 2025-07-21  6:10 Pan Deng
  2025-07-21  6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
                   ` (4 more replies)
  0 siblings, 5 replies; 41+ messages in thread
From: Pan Deng @ 2025-07-21  6:10 UTC (permalink / raw)
  To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng

When running multi-instance FFmpeg workload in cloud environment,
cache line contention is severe during the access to root_domain data
structures, which significantly degrades performance.

The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS(frame per second) is used as score.

Profiling shows the kernel consumes ~20% of CPU cycles, which is
excessive in this scenario. The overhead primarily comes from RT task
scheduling functions like `cpupri_set`, `cpupri_find_fitness`,
`dequeue_pushable_task`, `enqueue_pushable_task`, `pull_rt_task`,
`__find_first_and_bit`, and `__bitmap_and`. This is due to read/write
contention on root_domain cache lines.

The `perf c2c` report, sorted by contention severity, reveals:

root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` is heavily loaded/stored,
   since counts[0] is more frequently updated than others along with a
   rt task enqueues an empty runq or dequeues from a non-overloaded runq.
- `rto_mask` is heavily loaded
- `rto_loop_next` and `rto_loop_start` are frequently stored
- `rto_push_work` and `rto_lock` are lightly accessed
- cycles per load: ~10K to 59K.

root_domain cache line 1:
- `rto_count` is frequently loaded/stored
- `overloaded` is heavily loaded
- cycles per load: ~2.8K to 44K

cpumask (bitmap) cache line of cpupri_vec->mask:
- bits are loaded during cpupri_find
- bits are stored during cpupri_set
- cycles per load: ~2.2K to 8.7K

The end cache line of cpupri:
- `cpupri_vec->count` and `mask` contends. The transcoding threads use
  rt pri 99, so that the contention occurs in the end.
- cycles per load: ~1.5K to 10.5K

According to above, we propose 4 patches to mitigate the contention,
each patch resolves part of above issues:
Patch 1: Reorganize `cpupri_vec`, separate `count`, `mask` fields,
         reducing contention on root_domain cache line 3 and cpupri's
         last cache line. This patch has an alternative implementation,
         which is described in the patch commit message, welcome any
         comments.
Patch 2: Restructure `root_domain` structure to minimize contention of
         root_domain cache line 1 and 3 by reordering fields.
Patch 3: Split `root_domain->rto_count` to per-NUMA-node counters,
         reducing the contention on root_domain cache line 1.
Patch 4: Split `cpupri_vec->cpumask` to per-NUMA-node bitmaps, reducing
         load/store contention on the cpumask bitmap cache line.

Evaluation:
The patches are tested non-cumulatively, I'm happly to provide additional
data as needed.

FFmpeg benchmark:
Performance changes (FPS):
- Baseline:             100.0%
- Baseline + Patch 1:   111.0%
- Baseline + Patch 2:   105.0%
- Baseline + Patch 3:   104.0%
- Baseline + Patch 4:   103.8%

Kernel CPU cycle usage(lower is better):
- Baseline:              20.0%
- Baseline + Patch 1:    11.0%
- Baseline + Patch 2:    17.7%
- Baseline + Patch 3:    18.6%
- Baseline + Patch 4:    18.7%

Cycles per load reduction (by perf c2c report):
- Patch 1:
  - `root_domain` cache line 3:    10K–59K    ->  0.5K–8K
  - `cpupri` last cache line:      1.5K–10.5K ->  eliminated
- Patch 2:
  - `root_domain` cache line 1:    2.8K–44K   ->  2.1K–2.7K
  - `root_domain` cache line 3:    10K–59K    ->  eliminated
- Patch 3:
  - `root_domain` cache line 1:    2.8K–44K   ->  eliminated
- Patch 4:
  - `cpupri_vec->mask` cache line: 2.2K–8.7K  ->  0.5K–2.2K

stress-ng rt cyclic benchmark:
Command:
stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo   \
                    --timeout 30 --minimize --metrics

Performance changes (bogo ops/s, real time):
- Baseline:             100.0%
- Baseline + Patch 1:   131.4%
- Baseline + Patch 2:   118.6%
- Baseline + Patch 3:   150.4%
- Baseline + Patch 4:   105.9%

rt-tests pi_stress benchmark:
Command:
rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))

Performance changes (Total inversions performed):
- Baseline:             100.0%
- Baseline + Patch 1:   176.5%
- Baseline + Patch 2:   104.7%
- Baseline + Patch 3:   105.1%
- Baseline + Patch 4:   109.3%

Changes since v1:
 - Patch 3: Fixed non CONFIG_SMP build issue.
 - Patch 1-4: Added stress-ng/cyclic and rt-tests/pi_stress test result.

Comments are appreciated, I'm looking forward to hearing feedback
making revisions, thanks a lot!

Pan Deng (4):
  sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  sched/rt: Restructure root_domain to reduce cacheline contention
  sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce
    contention

 kernel/sched/cpupri.c   | 200 ++++++++++++++++++++++++++++++++++++----
 kernel/sched/cpupri.h   |   6 +-
 kernel/sched/rt.c       |  56 ++++++++++-
 kernel/sched/sched.h    |  61 ++++++------
 kernel/sched/topology.c |   7 ++
 5 files changed, 282 insertions(+), 48 deletions(-)

--
2.43.5


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  2025-07-21  6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
@ 2025-07-21  6:10 ` Pan Deng
  2026-03-20 10:09   ` Peter Zijlstra
  2026-04-08 10:16   ` Chen, Yu C
  2025-07-21  6:10 ` [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 41+ messages in thread
From: Pan Deng @ 2025-07-21  6:10 UTC (permalink / raw)
  To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng

When running a multi-instance FFmpeg workload on an HCC system, significant
cache line contention is observed around `cpupri_vec->count` and `mask` in
struct root_domain.

The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.

perf c2c tool reveals:
root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored
   and contends with other fields, since counts[0] is more frequently
   updated than others along with a rt task enqueues an empty runq or
   dequeues from a non-overloaded runq.
- cycles per load: ~10K to 59K

cpupri's last cache line:
- `cpupri_vec->count` and `mask` contends. The transcoding threads use
  rt pri 99, so that the contention occurs in the end.
- cycles per load: ~1.5K to 10.5K

This change mitigates `cpupri_vec->count`, `mask` related contentions by
separating each count and mask into different cache lines.

As a result:
- FPS improves by ~11%
- Kernel cycles% drops from ~20% to ~11%
- `count` and `mask` related cache line contention is mitigated, perf c2c
  shows root_domain cache line 3 `cycles per load` drops from ~10K-59K
  to ~0.5K-8K, cpupri's last cache line no longer appears in the report.
- stress-ng cyclic benchmark is improved ~31.4%, command:
  stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo   \
                      --timeout 30 --minimize --metrics
- rt-tests/pi_stress is improved ~76.5%, command:
  rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))

Appendix:
1. Current layout of contended data structure:
struct root_domain {
    ...
    struct irq_work            rto_push_work;        /*   120    32 */
    /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
    raw_spinlock_t             rto_lock;             /*   152     4 */
    int                        rto_loop;             /*   156     4 */
    int                        rto_cpu;              /*   160     4 */
    atomic_t                   rto_loop_next;        /*   164     4 */
    atomic_t                   rto_loop_start;       /*   168     4 */
    /* XXX 4 bytes hole, try to pack */
    cpumask_var_t              rto_mask;             /*   176     8 */
    /* --- cacheline 3 boundary (192 bytes) was 8 bytes hence --- */
    struct cpupri              cpupri;               /*   184  1624 */
    /* --- cacheline 28 boundary (1792 bytes) was 16 bytes ago --- */
    struct perf_domain *       pd;                   /*  1808     8 */
    /* size: 1816, cachelines: 29, members: 21 */
    /* sum members: 1802, holes: 3, sum holes: 14 */
    /* forced alignments: 1 */
    /* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));

2. Perf c2c report of root_domain cache line 3:
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 353       44       62    0xff14d42c400e3880
-------  -------  ------  ------  ------  ------  ------------------------
 0.00%    2.27%    0.00%  0x0     21683   6     __flush_smp_call_function_
 0.00%    2.27%    0.00%  0x0     22294   5     __flush_smp_call_function_
 0.28%    0.00%    0.00%  0x0     0       2     irq_work_queue_on
 0.28%    0.00%    0.00%  0x0     27824   4     irq_work_single
 0.00%    0.00%    1.61%  0x0     28151   6     irq_work_queue_on
 0.57%    0.00%    0.00%  0x18    21822   8     native_queued_spin_lock_sl
 0.28%    2.27%    0.00%  0x18    16101   10    native_queued_spin_lock_sl
 0.57%    0.00%    0.00%  0x18    33199   5     native_queued_spin_lock_sl
 0.00%    0.00%    1.61%  0x18    10908   32    _raw_spin_lock
 0.00%    0.00%    1.61%  0x18    59770   2     _raw_spin_lock
 0.00%    0.00%    1.61%  0x18    0       1     _raw_spin_unlock
 1.42%    0.00%    0.00%  0x20    12918   20    pull_rt_task
 0.85%    0.00%   25.81%  0x24    31123   199   pull_rt_task
 0.85%    0.00%    3.23%  0x24    38218   24    pull_rt_task
 0.57%    4.55%   19.35%  0x28    30558   207   pull_rt_task
 0.28%    0.00%    0.00%  0x28    55504   10    pull_rt_task
18.70%   18.18%    0.00%  0x30    26438   291   dequeue_pushable_task
17.28%   22.73%    0.00%  0x30    29347   281   enqueue_pushable_task
 1.70%    2.27%    0.00%  0x30    12819   31    enqueue_pushable_task
 0.28%    0.00%    0.00%  0x30    17726   18    dequeue_pushable_task
34.56%   29.55%    0.00%  0x38    25509   527   cpupri_find_fitness
13.88%   11.36%   24.19%  0x38    30654   342   cpupri_set
 3.12%    2.27%    0.00%  0x38    18093   39    cpupri_set
 1.70%    0.00%    0.00%  0x38    37661   52    cpupri_find_fitness
 1.42%    2.27%   19.35%  0x38    31110   211   cpupri_set
 1.42%    0.00%    1.61%  0x38    45035   31    cpupri_set

3. Perf c2c report of cpupri's last cache line
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 149       43       41    0xff14d42c400e3ec0
-------  -------  ------  ------  ------  ------  ------------------------
 8.72%   11.63%    0.00%  0x8     2001    165   cpupri_find_fitness
 1.34%    2.33%    0.00%  0x18    1456    151   cpupri_find_fitness
 8.72%    9.30%   58.54%  0x28    1744    263   cpupri_set
 2.01%    4.65%   41.46%  0x28    1958    301   cpupri_set
 1.34%    0.00%    0.00%  0x28    10580   6     cpupri_set
69.80%   67.44%    0.00%  0x30    1754    347   cpupri_set
 8.05%    4.65%    0.00%  0x30    2144    256   cpupri_set

Signed-off-by: Pan Deng <pan.deng@intel.com>
Signed-off-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Note: The side effect of this change is that struct cpupri size is
increased from 26 cache lines to 203 cache lines.

An alternative implementation of this patch could be separating `counts`
and `masks` into 2 vectors in cpupri_vec (counts[] and masks[]), and
add two paddings:
1. Between counts[0] and counts[1], since counts[0] is more frequently
   updated than others.
2. Between the two vectors, since counts[] is read-write access  while
   masks[] is read access when it stores pointers.

The alternative introduces the complexity of 31+/21- LoC changes,
it achieves almost the same performance, at the same time, struct cpupri
size is reduced from 26 cache lines to 21 cache lines.
---
 kernel/sched/cpupri.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..245b0fa626be 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -9,7 +9,7 @@
 
 struct cpupri_vec {
 	atomic_t		count;
-	cpumask_var_t		mask;
+	cpumask_var_t		mask	____cacheline_aligned;
 };
 
 struct cpupri {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention
  2025-07-21  6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
  2025-07-21  6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
@ 2025-07-21  6:10 ` Pan Deng
  2026-03-20 10:18   ` Peter Zijlstra
  2025-07-21  6:10 ` [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Pan Deng @ 2025-07-21  6:10 UTC (permalink / raw)
  To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng

When running a multi-instance FFmpeg workload on HCC system, significant
contention is observed in root_domain cacheline 1 and 3.

The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.

perf c2c tool reveals (sorted by contention severity):
root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored,
   since counts[0] is more frequently updated than others along with a
   rt task enqueues an empty runq or dequeues from a non-overloaded runq.
- `rto_mask` (0x30) is heavily loaded
- `rto_loop_next` (0x24) and `rto_loop_start` (0x28) are frequently stored
- `rto_push_work` (0x0) and `rto_lock` (0x18) are lightly accessed
- cycles per load: ~10K to 59K

root_domain cache line 1:
- `rto_count` (0x4) is frequently loaded/stored
- `overloaded` (0x28) is heavily loaded
- cycles per load: ~2.8K to 44K:

This change adjusts the layout of `root_domain` to isolate these contended
fields across separate cache lines:
1. `rto_count` remains in the 1st cache line; `overloaded` and
   `overutilized` are moved to the last cache line
2. `rto_push_work` is placed in the 2nd cache line
3. `rto_loop_start`, `rto_loop_next`, and `rto_lock` remain in the 3rd
   cache line; `rto_mask` is moved near `pd` in the penultimate cache line
4. `cpupri` starts at the 4th cache line to prevent `pri_to_cpu[0].count`
   contending with fields in cache line 3.

With this change:
- FPS improves by ~5%
- Kernel cycles% drops from ~20% to ~17.7%
- root_domain cache line 3 no longer appears in perf-c2c report
- cycles per load of root_domain cache line 1 is reduced to from
  ~2.8K-44K to ~2.1K-2.7K
- stress-ng cyclic benchmark is improved ~18.6%, command:
  stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo   \
                      --timeout 30 --minimize --metrics
- rt-tests/pi_stress is improved ~4.7%, command:
  rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))

According to the nature of the change, to my understanding, it doesn`t
introduce any negative impact in other scenario.

Note: This change increases the size of `root_domain` from 29 to 31 cache
lines, it's considered acceptable since `root_domain` is a single global
object.

Appendix:
1. Current layout of contended data structure:
struct root_domain {
    atomic_t                   refcount;             /*     0     4 */
    atomic_t                   rto_count;            /*     4     4 */
    struct callback_head rcu __attribute__((__aligned__(8)));/*8 16 */
    cpumask_var_t              span;                 /*    24     8 */
    cpumask_var_t              online;               /*    32     8 */
    bool                       overloaded;           /*    40     1 */
    bool                       overutilized;         /*    41     1 */
    /* XXX 6 bytes hole, try to pack */
    cpumask_var_t              dlo_mask;             /*    48     8 */
    atomic_t                   dlo_count;            /*    56     4 */
    /* XXX 4 bytes hole, try to pack */

    /* --- cacheline 1 boundary (64 bytes) --- */
    struct dl_bw               dl_bw;                /*    64    24 */
    struct cpudl               cpudl;                /*    88    24 */
    u64                        visit_gen;            /*   112     8 */
    struct irq_work            rto_push_work;        /*   120    32 */

    /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
    raw_spinlock_t             rto_lock;             /*   152     4 */
    int                        rto_loop;             /*   156     4 */
    int                        rto_cpu;              /*   160     4 */
    atomic_t                   rto_loop_next;        /*   164     4 */
    atomic_t                   rto_loop_start;       /*   168     4 */
    /* XXX 4 bytes hole, try to pack */
    cpumask_var_t              rto_mask;             /*   176     8 */

    /* --- cacheline 3 boundary (192 bytes) was 8 bytes hence --- */
    struct cpupri              cpupri;               /*   184  1624 */
    ...
} __attribute__((__aligned__(8)));

2. Perf c2c report of root_domain cache line 3:
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 353       44       62    0xff14d42c400e3880
-------  -------  ------  ------  ------  ------  ------------------------
 0.00%    2.27%    0.00%  0x0     21683   6     __flush_smp_call_function_
 0.00%    2.27%    0.00%  0x0     22294   5     __flush_smp_call_function_
 0.28%    0.00%    0.00%  0x0     0       2     irq_work_queue_on
 0.28%    0.00%    0.00%  0x0     27824   4     irq_work_single
 0.00%    0.00%    1.61%  0x0     28151   6     irq_work_queue_on
 0.57%    0.00%    0.00%  0x18    21822   8     native_queued_spin_lock_sl
 0.28%    2.27%    0.00%  0x18    16101   10    native_queued_spin_lock_sl
 0.57%    0.00%    0.00%  0x18    33199   5     native_queued_spin_lock_sl
 0.00%    0.00%    1.61%  0x18    10908   32    _raw_spin_lock
 0.00%    0.00%    1.61%  0x18    59770   2     _raw_spin_lock
 0.00%    0.00%    1.61%  0x18    0       1     _raw_spin_unlock
 1.42%    0.00%    0.00%  0x20    12918   20    pull_rt_task
 0.85%    0.00%   25.81%  0x24    31123   199   pull_rt_task
 0.85%    0.00%    3.23%  0x24    38218   24    pull_rt_task
 0.57%    4.55%   19.35%  0x28    30558   207   pull_rt_task
 0.28%    0.00%    0.00%  0x28    55504   10    pull_rt_task
18.70%   18.18%    0.00%  0x30    26438   291   dequeue_pushable_task
17.28%   22.73%    0.00%  0x30    29347   281   enqueue_pushable_task
 1.70%    2.27%    0.00%  0x30    12819   31    enqueue_pushable_task
 0.28%    0.00%    0.00%  0x30    17726   18    dequeue_pushable_task
34.56%   29.55%    0.00%  0x38    25509   527   cpupri_find_fitness
13.88%   11.36%   24.19%  0x38    30654   342   cpupri_set
 3.12%    2.27%    0.00%  0x38    18093   39    cpupri_set
 1.70%    0.00%    0.00%  0x38    37661   52    cpupri_find_fitness
 1.42%    2.27%   19.35%  0x38    31110   211   cpupri_set
 1.42%    0.00%    1.61%  0x38    45035   31    cpupri_set

3. Perf c2c report of root_domain cache line 1:
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 231       43       48    0xff14d42c400e3800
-------  -------  ------  ------  ------  ------  ------------------------
22.51%   18.60%    0.00%  0x4     5041    247   pull_rt_task
 5.63%    2.33%   45.83%  0x4     6995    315   dequeue_pushable_task
 3.90%    4.65%   54.17%  0x4     6587    370   enqueue_pushable_task
 0.43%    0.00%    0.00%  0x4     17111   4     enqueue_pushable_task
 0.43%    0.00%    0.00%  0x4     44062   4     dequeue_pushable_task
32.03%   27.91%    0.00%  0x28    6393    285   enqueue_task_rt
16.45%   27.91%    0.00%  0x28    5534    139   sched_balance_newidle
14.72%   18.60%    0.00%  0x28    5287    110   dequeue_task_rt
 3.46%    0.00%    0.00%  0x28    2820    25    enqueue_task_fair
 0.43%    0.00%    0.00%  0x28    220     3     enqueue_task_stop

Signed-off-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/sched.h | 52 +++++++++++++++++++++++---------------------
 1 file changed, 27 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 83e3aa917142..bc67806911f2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -968,24 +968,29 @@ struct root_domain {
 	cpumask_var_t		span;
 	cpumask_var_t		online;
 
+	atomic_t		dlo_count;
+	struct dl_bw		dl_bw;
+	struct cpudl		cpudl;
+
+#ifdef HAVE_RT_PUSH_IPI
 	/*
-	 * Indicate pullable load on at least one CPU, e.g:
-	 * - More than one runnable task
-	 * - Running task is misfit
+	 * For IPI pull requests, loop across the rto_mask.
 	 */
-	bool			overloaded;
-
-	/* Indicate one or more CPUs over-utilized (tipping point) */
-	bool			overutilized;
+	struct irq_work		rto_push_work;
+	raw_spinlock_t		rto_lock;
+	/* These are only updated and read within rto_lock */
+	int			rto_loop;
+	int			rto_cpu;
+	/* These atomics are updated outside of a lock */
+	atomic_t		rto_loop_next;
+	atomic_t		rto_loop_start;
+#endif
 
 	/*
 	 * The bit corresponding to a CPU gets set here if such CPU has more
 	 * than one runnable -deadline task (as it is below for RT tasks).
 	 */
 	cpumask_var_t		dlo_mask;
-	atomic_t		dlo_count;
-	struct dl_bw		dl_bw;
-	struct cpudl		cpudl;
 
 	/*
 	 * Indicate whether a root_domain's dl_bw has been checked or
@@ -995,32 +1000,29 @@ struct root_domain {
 	 * that u64 is 'big enough'. So that shouldn't be a concern.
 	 */
 	u64 visit_cookie;
+	struct cpupri		cpupri	____cacheline_aligned;
 
-#ifdef HAVE_RT_PUSH_IPI
 	/*
-	 * For IPI pull requests, loop across the rto_mask.
+	 * NULL-terminated list of performance domains intersecting with the
+	 * CPUs of the rd. Protected by RCU.
 	 */
-	struct irq_work		rto_push_work;
-	raw_spinlock_t		rto_lock;
-	/* These are only updated and read within rto_lock */
-	int			rto_loop;
-	int			rto_cpu;
-	/* These atomics are updated outside of a lock */
-	atomic_t		rto_loop_next;
-	atomic_t		rto_loop_start;
-#endif
+	struct perf_domain __rcu *pd	____cacheline_aligned;
+
 	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
 	 * one runnable RT task.
 	 */
 	cpumask_var_t		rto_mask;
-	struct cpupri		cpupri;
 
 	/*
-	 * NULL-terminated list of performance domains intersecting with the
-	 * CPUs of the rd. Protected by RCU.
+	 * Indicate pullable load on at least one CPU, e.g:
+	 * - More than one runnable task
+	 * - Running task is misfit
 	 */
-	struct perf_domain __rcu *pd;
+	bool			overloaded	____cacheline_aligned;
+
+	/* Indicate one or more CPUs over-utilized (tipping point) */
+	bool			overutilized;
 };
 
 extern void init_defrootdomain(void);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  2025-07-21  6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
  2025-07-21  6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
  2025-07-21  6:10 ` [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
@ 2025-07-21  6:10 ` Pan Deng
  2026-03-20 10:24   ` Peter Zijlstra
  2025-07-21  6:10 ` [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
  2026-03-20  9:59 ` [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Peter Zijlstra
  4 siblings, 1 reply; 41+ messages in thread
From: Pan Deng @ 2025-07-21  6:10 UTC (permalink / raw)
  To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng

When running a multi-instance FFmpeg workload on HCC system, significant
contention is observed on root_domain `rto_count` and `overloaded` fields.

The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.

perf c2c tool reveals:
root_domain cache line 1:
- `rto_count` (0x4) is frequently loaded/stored
- `overloaded` (0x28) is heavily loaded
- cycles per load: ~2.8K to 44K:

A separate patch rearranges root_domain to place `overloaded` on a
different cache line, but this alone is insufficient to resolve the
contention on `rto_count`. As a complementary, this patch splits
`rto_count` into per-numa-node counters to reduce the contention.

With this change:
- FPS improves by ~4%
- Kernel cycles% drops from ~20% to ~18.6%
- The cache line no longer appears in perf-c2c report
- stress-ng cyclic benchmark is improved ~50.4%, command:
  stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo   \
                      --timeout 30 --minimize --metrics
- rt-tests/pi_stress is improved ~5.1%, command:
  rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))

Appendix:
1. Perf c2c report of root_domain cache line 1:
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 231       43       48    0xff14d42c400e3800
-------  -------  ------  ------  ------  ------  ------------------------
22.51%   18.60%    0.00%  0x4     5041    247   pull_rt_task
 5.63%    2.33%   45.83%  0x4     6995    315   dequeue_pushable_task
 3.90%    4.65%   54.17%  0x4     6587    370   enqueue_pushable_task
 0.43%    0.00%    0.00%  0x4     17111   4     enqueue_pushable_task
 0.43%    0.00%    0.00%  0x4     44062   4     dequeue_pushable_task
32.03%   27.91%    0.00%  0x28    6393    285   enqueue_task_rt
16.45%   27.91%    0.00%  0x28    5534    139   sched_balance_newidle
14.72%   18.60%    0.00%  0x28    5287    110   dequeue_task_rt
 3.46%    0.00%    0.00%  0x28    2820    25    enqueue_task_fair
 0.43%    0.00%    0.00%  0x28    220     3     enqueue_task_stop

Signed-off-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
V1 -> V2: Fixed non CONFIG_SMP build issue
---
 kernel/sched/rt.c       | 56 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h    |  9 ++++++-
 kernel/sched/topology.c |  7 ++++++
 3 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e40422c37033..cbcfd3aa3439 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -337,9 +337,58 @@ static inline bool need_pull_rt_task(struct rq *rq, struct task_struct *prev)
 	return rq->online && rq->rt.highest_prio.curr > prev->prio;
 }
 
+int rto_counts_init(atomic_tp **rto_counts)
+{
+	int i;
+	atomic_tp *counts = kzalloc(nr_node_ids * sizeof(atomic_tp), GFP_KERNEL);
+
+	if (!counts)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_node_ids; i++) {
+		counts[i] = kzalloc_node(sizeof(atomic_t), GFP_KERNEL, i);
+
+		if (!counts[i])
+			goto cleanup;
+	}
+
+	*rto_counts = counts;
+	return 0;
+
+cleanup:
+	while (i--)
+		kfree(counts[i]);
+
+	kfree(counts);
+	return -ENOMEM;
+}
+
+void rto_counts_cleanup(atomic_tp *rto_counts)
+{
+	for (int i = 0; i < nr_node_ids; i++)
+		kfree(rto_counts[i]);
+
+	kfree(rto_counts);
+}
+
 static inline int rt_overloaded(struct rq *rq)
 {
-	return atomic_read(&rq->rd->rto_count);
+	int count = 0;
+	int cur_node, nid;
+
+	cur_node = numa_node_id();
+
+	for (int i = 0; i < nr_node_ids; i++) {
+		nid = (cur_node + i) % nr_node_ids;
+		count += atomic_read(rq->rd->rto_counts[nid]);
+
+		// The caller only checks if it is 0
+		// or 1, so that return once > 1
+		if (count > 1)
+			return count;
+	}
+
+	return count;
 }
 
 static inline void rt_set_overload(struct rq *rq)
@@ -358,7 +407,7 @@ static inline void rt_set_overload(struct rq *rq)
 	 * Matched by the barrier in pull_rt_task().
 	 */
 	smp_wmb();
-	atomic_inc(&rq->rd->rto_count);
+	atomic_inc(rq->rd->rto_counts[cpu_to_node(rq->cpu)]);
 }
 
 static inline void rt_clear_overload(struct rq *rq)
@@ -367,7 +416,7 @@ static inline void rt_clear_overload(struct rq *rq)
 		return;
 
 	/* the order here really doesn't matter */
-	atomic_dec(&rq->rd->rto_count);
+	atomic_dec(rq->rd->rto_counts[cpu_to_node(rq->cpu)]);
 	cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask);
 }
 
@@ -443,6 +492,7 @@ static inline void dequeue_pushable_task(struct rq *rq, struct task_struct *p)
 static inline void rt_queue_push_tasks(struct rq *rq)
 {
 }
+
 #endif /* CONFIG_SMP */
 
 static void enqueue_top_rt_rq(struct rt_rq *rt_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bc67806911f2..13fc3ac3381b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -953,6 +953,8 @@ struct perf_domain {
 	struct rcu_head rcu;
 };
 
+typedef atomic_t *atomic_tp;
+
 /*
  * We add the notion of a root-domain which will be used to define per-domain
  * variables. Each exclusive cpuset essentially defines an island domain by
@@ -963,12 +965,15 @@ struct perf_domain {
  */
 struct root_domain {
 	atomic_t		refcount;
-	atomic_t		rto_count;
 	struct rcu_head		rcu;
 	cpumask_var_t		span;
 	cpumask_var_t		online;
 
 	atomic_t		dlo_count;
+
+	/* rto_count per node */
+	atomic_tp		*rto_counts;
+
 	struct dl_bw		dl_bw;
 	struct cpudl		cpudl;
 
@@ -1030,6 +1035,8 @@ extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
 extern void sched_get_rd(struct root_domain *rd);
 extern void sched_put_rd(struct root_domain *rd);
+extern int rto_counts_init(atomic_tp **rto_counts);
+extern void rto_counts_cleanup(atomic_tp *rto_counts);
 
 static inline int get_rd_overloaded(struct root_domain *rd)
 {
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b958fe48e020..166dc8177a44 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -457,6 +457,7 @@ static void free_rootdomain(struct rcu_head *rcu)
 {
 	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
 
+	rto_counts_cleanup(rd->rto_counts);
 	cpupri_cleanup(&rd->cpupri);
 	cpudl_cleanup(&rd->cpudl);
 	free_cpumask_var(rd->dlo_mask);
@@ -549,8 +550,14 @@ static int init_rootdomain(struct root_domain *rd)
 
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_cpudl;
+
+	if (rto_counts_init(&rd->rto_counts) != 0)
+		goto free_cpupri;
+
 	return 0;
 
+free_cpupri:
+	cpupri_cleanup(&rd->cpupri);
 free_cpudl:
 	cpudl_cleanup(&rd->cpudl);
 free_rto_mask:
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2025-07-21  6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
                   ` (2 preceding siblings ...)
  2025-07-21  6:10 ` [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
@ 2025-07-21  6:10 ` Pan Deng
  2026-03-20 12:40   ` Peter Zijlstra
  2026-03-20  9:59 ` [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Peter Zijlstra
  4 siblings, 1 reply; 41+ messages in thread
From: Pan Deng @ 2025-07-21  6:10 UTC (permalink / raw)
  To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng

When running a multi-instance FFmpeg workload on HCC system, significant
contention is observed on bitmap of `cpupri_vec->cpumask`.

The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.

perf c2c tool reveals:
cpumask (bitmap) cache line of `cpupri_vec->mask`:
- bits are loaded during cpupri_find
- bits are stored during cpupri_set
- cycles per load: ~2.2K to 8.7K

This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
mitigate false sharing.

As a result:
- FPS improves by ~3.8%
- Kernel cycles% drops from ~20% to ~18.7%
- Cache line contention is mitigated, perf-c2c shows cycles per load
  drops from ~2.2K-8.7K to ~0.5K-2.2K
- stress-ng cyclic benchmark is improved ~5.9%, command:
  stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo   \
                      --timeout 30 --minimize --metrics
- rt-tests/pi_stress is improved ~9.3%, command:
  rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))

Note: CONFIG_CPUMASK_OFFSTACK=n remains unchanged.

Appendix:
1. Perf c2c report of `cpupri_vec->mask` bitmap cache line:
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 155       39       39    0xff14d52c4682d800
-------  -------  ------  ------  ------  ------  ------------------------
43.23%   43.59%    0.00%  0x0     3489    415   _find_first_and_bit
 3.23%    5.13%    0.00%  0x0     3478    107   __bitmap_and
 3.23%    0.00%    0.00%  0x0     2712    33    _find_first_and_bit
 1.94%    0.00%    7.69%  0x0     5992    33    cpupri_set
 0.00%    0.00%    5.13%  0x0     3733    19    cpupri_set
12.90%   12.82%    0.00%  0x8     3452    297   _find_first_and_bit
 1.29%    2.56%    0.00%  0x8     3007    117   __bitmap_and
 0.00%    5.13%    0.00%  0x8     3041    20    _find_first_and_bit
 0.00%    2.56%    2.56%  0x8     2374    22    cpupri_set
 0.00%    0.00%    7.69%  0x8     4194    38    cpupri_set
 8.39%    2.56%    0.00%  0x10    3336    264   _find_first_and_bit
 3.23%    0.00%    0.00%  0x10    3023    46    _find_first_and_bit
 2.58%    0.00%    0.00%  0x10    3040    130   __bitmap_and
 1.29%    0.00%   12.82%  0x10    4075    34    cpupri_set
 0.00%    0.00%    2.56%  0x10    2197    19    cpupri_set
 0.00%    2.56%    7.69%  0x18    4085    27    cpupri_set
 0.00%    2.56%    0.00%  0x18    3128    220   _find_first_and_bit
 0.00%    0.00%    5.13%  0x18    3028    20    cpupri_set
 2.58%    2.56%    0.00%  0x20    3089    198   _find_first_and_bit
 1.29%    0.00%    5.13%  0x20    5114    29    cpupri_set
 0.65%    2.56%    0.00%  0x20    3224    96    __bitmap_and
 0.65%    0.00%    7.69%  0x20    4392    31    cpupri_set
 2.58%    0.00%    0.00%  0x28    3327    214   _find_first_and_bit
 0.65%    2.56%    5.13%  0x28    5252    31    cpupri_set
 0.65%    0.00%    7.69%  0x28    8755    25    cpupri_set
 0.65%    0.00%    0.00%  0x28    4414    14    _find_first_and_bit
 1.29%    2.56%    0.00%  0x30    3139    171   _find_first_and_bit
 0.65%    0.00%    7.69%  0x30    2185    18    cpupri_set
 0.65%    0.00%    0.00%  0x30    3404    108   __bitmap_and
 0.00%    0.00%    2.56%  0x30    5542    21    cpupri_set
 3.23%    5.13%    0.00%  0x38    3493    190   _find_first_and_bit
 3.23%    2.56%    0.00%  0x38    3171    108   __bitmap_and
 0.00%    2.56%    7.69%  0x38    3285    14    cpupri_set
 0.00%    0.00%    5.13%  0x38    4035    27    cpupri_set

Signed-off-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/cpupri.c | 200 ++++++++++++++++++++++++++++++++++++++----
 kernel/sched/cpupri.h |   4 +
 2 files changed, 186 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index 42c40cfdf836..306b6baff4cd 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -64,6 +64,143 @@ static int convert_prio(int prio)
 	return cpupri;
 }
 
+#ifdef	CONFIG_CPUMASK_OFFSTACK
+static inline int alloc_vec_masks(struct cpupri_vec *vec)
+{
+	int i;
+
+	for (i = 0; i < nr_node_ids; i++) {
+		if (!zalloc_cpumask_var_node(&vec->masks[i], GFP_KERNEL, i))
+			goto cleanup;
+
+		// Clear masks of cur node, set others
+		bitmap_complement(cpumask_bits(vec->masks[i]),
+			cpumask_bits(cpumask_of_node(i)), small_cpumask_bits);
+	}
+	return 0;
+
+cleanup:
+	while (i--)
+		free_cpumask_var(vec->masks[i]);
+	return -ENOMEM;
+}
+
+static inline void free_vec_masks(struct cpupri_vec *vec)
+{
+	for (int i = 0; i < nr_node_ids; i++)
+		free_cpumask_var(vec->masks[i]);
+}
+
+static inline int setup_vec_mask_var_ts(struct cpupri *cp)
+{
+	int i;
+
+	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+		struct cpupri_vec *vec = &cp->pri_to_cpu[i];
+
+		vec->masks = kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL);
+		if (!vec->masks)
+			goto cleanup;
+	}
+	return 0;
+
+cleanup:
+	/* Free any already allocated masks */
+	while (i--) {
+		kfree(cp->pri_to_cpu[i].masks);
+		cp->pri_to_cpu[i].masks = NULL;
+	}
+
+	return -ENOMEM;
+}
+
+static inline void free_vec_mask_var_ts(struct cpupri *cp)
+{
+	for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+		kfree(cp->pri_to_cpu[i].masks);
+		cp->pri_to_cpu[i].masks = NULL;
+	}
+}
+
+static inline int
+available_cpu_in_nodes(struct task_struct *p, struct cpupri_vec *vec)
+{
+	int cur_node = numa_node_id();
+
+	for (int i = 0; i < nr_node_ids; i++) {
+		int nid = (cur_node + i) % nr_node_ids;
+
+		if (cpumask_first_and_and(&p->cpus_mask, vec->masks[nid],
+					cpumask_of_node(nid)) < nr_cpu_ids)
+			return 1;
+	}
+
+	return 0;
+}
+
+#define available_cpu_in_vec available_cpu_in_nodes
+
+#else /* !CONFIG_CPUMASK_OFFSTACK */
+
+static inline int alloc_vec_masks(struct cpupri_vec *vec)
+{
+	if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static inline void free_vec_masks(struct cpupri_vec *vec)
+{
+	free_cpumask_var(vec->mask);
+}
+
+static inline int setup_vec_mask_var_ts(struct cpupri *cp)
+{
+	return 0;
+}
+
+static inline void free_vec_mask_var_ts(struct cpupri *cp)
+{
+}
+
+static inline int
+available_cpu_in_vec(struct task_struct *p, struct cpupri_vec *vec)
+{
+	if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids)
+		return 0;
+
+	return 1;
+}
+#endif
+
+static inline int alloc_all_masks(struct cpupri *cp)
+{
+	int i;
+
+	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+		if (alloc_vec_masks(&cp->pri_to_cpu[i]))
+			goto cleanup;
+	}
+
+	return 0;
+
+cleanup:
+	while (i--)
+		free_vec_masks(&cp->pri_to_cpu[i]);
+
+	return -ENOMEM;
+}
+
+static inline void setup_vec_counts(struct cpupri *cp)
+{
+	for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+		struct cpupri_vec *vec = &cp->pri_to_cpu[i];
+
+		atomic_set(&vec->count, 0);
+	}
+}
+
 static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 				struct cpumask *lowest_mask, int idx)
 {
@@ -96,11 +233,24 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 	if (skip)
 		return 0;
 
-	if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids)
+	if (!available_cpu_in_vec(p, vec))
 		return 0;
 
+#ifdef	CONFIG_CPUMASK_OFFSTACK
+	struct cpumask *cpupri_mask = lowest_mask;
+
+	// available && lowest_mask
+	if (lowest_mask) {
+		cpumask_copy(cpupri_mask, vec->masks[0]);
+		for (int nid = 1; nid < nr_node_ids; nid++)
+			cpumask_and(cpupri_mask, cpupri_mask, vec->masks[nid]);
+	}
+#else
+	struct cpumask *cpupri_mask = vec->mask;
+#endif
+
 	if (lowest_mask) {
-		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+		cpumask_and(lowest_mask, &p->cpus_mask, cpupri_mask);
 		cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
 
 		/*
@@ -229,7 +379,11 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 	if (likely(newpri != CPUPRI_INVALID)) {
 		struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
 
+#ifdef	CONFIG_CPUMASK_OFFSTACK
+		cpumask_set_cpu(cpu, vec->masks[cpu_to_node(cpu)]);
+#else
 		cpumask_set_cpu(cpu, vec->mask);
+#endif
 		/*
 		 * When adding a new vector, we update the mask first,
 		 * do a write memory barrier, and then update the count, to
@@ -263,7 +417,11 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 		 */
 		atomic_dec(&(vec)->count);
 		smp_mb__after_atomic();
+#ifdef	CONFIG_CPUMASK_OFFSTACK
+		cpumask_clear_cpu(cpu, vec->masks[cpu_to_node(cpu)]);
+#else
 		cpumask_clear_cpu(cpu, vec->mask);
+#endif
 	}
 
 	*currpri = newpri;
@@ -279,26 +437,31 @@ int cpupri_init(struct cpupri *cp)
 {
 	int i;
 
-	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
-		struct cpupri_vec *vec = &cp->pri_to_cpu[i];
-
-		atomic_set(&vec->count, 0);
-		if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL))
-			goto cleanup;
-	}
-
+	/* Allocate the cpu_to_pri array */
 	cp->cpu_to_pri = kcalloc(nr_cpu_ids, sizeof(int), GFP_KERNEL);
 	if (!cp->cpu_to_pri)
-		goto cleanup;
+		return -ENOMEM;
 
+	/* Initialize all CPUs to invalid priority */
 	for_each_possible_cpu(i)
 		cp->cpu_to_pri[i] = CPUPRI_INVALID;
 
+	/* Setup priority vectors */
+	setup_vec_counts(cp);
+	if (setup_vec_mask_var_ts(cp))
+		goto fail_setup_vectors;
+
+	/* Allocate masks for each priority vector */
+	if (alloc_all_masks(cp))
+		goto fail_alloc_masks;
+
 	return 0;
 
-cleanup:
-	for (i--; i >= 0; i--)
-		free_cpumask_var(cp->pri_to_cpu[i].mask);
+fail_alloc_masks:
+	free_vec_mask_var_ts(cp);
+
+fail_setup_vectors:
+	kfree(cp->cpu_to_pri);
 	return -ENOMEM;
 }
 
@@ -308,9 +471,10 @@ int cpupri_init(struct cpupri *cp)
  */
 void cpupri_cleanup(struct cpupri *cp)
 {
-	int i;
-
 	kfree(cp->cpu_to_pri);
-	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++)
-		free_cpumask_var(cp->pri_to_cpu[i].mask);
+
+	for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++)
+		free_vec_masks(&cp->pri_to_cpu[i]);
+
+	free_vec_mask_var_ts(cp);
 }
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index 245b0fa626be..c53f1f4dad86 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -9,7 +9,11 @@
 
 struct cpupri_vec {
 	atomic_t		count;
+#ifdef CONFIG_CPUMASK_OFFSTACK
+	cpumask_var_t		*masks	____cacheline_aligned;
+#else
 	cpumask_var_t		mask	____cacheline_aligned;
+#endif
 };
 
 struct cpupri {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention
  2025-07-21  6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
                   ` (3 preceding siblings ...)
  2025-07-21  6:10 ` [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
@ 2026-03-20  9:59 ` Peter Zijlstra
  2026-03-20 12:50   ` Peter Zijlstra
  4 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2026-03-20  9:59 UTC (permalink / raw)
  To: Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen

On Mon, Jul 21, 2025 at 02:10:22PM +0800, Pan Deng wrote:
> When running multi-instance FFmpeg workload in cloud environment,
> cache line contention is severe during the access to root_domain data
> structures, which significantly degrades performance.
> 
> The SUT is a 2-socket machine with 240 physical cores and 480 logical

What's a SUT?

> CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
> (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> with FIFO scheduling. FPS(frame per second) is used as score.

So I think we can do some of this, but that workload is hilariously
poorly configured.

You're pinning things but not partitioning, why?  If you would have
created 60 partitions, one for each FFmpeg thingy, then you wouldn't
have needed any of this.

You're running at FIFO99 (IOW prio-0) and then claiming prio-0 is used
more heavily than others... will d0h.  What priority assignment scheme
led to this? Is there a sensible reason these must be 99?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  2025-07-21  6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
@ 2026-03-20 10:09   ` Peter Zijlstra
  2026-03-24  9:36     ` Deng, Pan
  2026-04-08 10:16   ` Chen, Yu C
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2026-03-20 10:09 UTC (permalink / raw)
  To: Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen

On Mon, Jul 21, 2025 at 02:10:23PM +0800, Pan Deng wrote:
> When running a multi-instance FFmpeg workload on an HCC system, significant
> cache line contention is observed around `cpupri_vec->count` and `mask` in
> struct root_domain.
> 
> The SUT is a 2-socket machine with 240 physical cores and 480 logical
> CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
> (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> with FIFO scheduling. FPS is used as score.
> 
> perf c2c tool reveals:
> root_domain cache line 3:
> - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored
>    and contends with other fields, since counts[0] is more frequently
>    updated than others along with a rt task enqueues an empty runq or
>    dequeues from a non-overloaded runq.
> - cycles per load: ~10K to 59K
> 
> cpupri's last cache line:
> - `cpupri_vec->count` and `mask` contends. The transcoding threads use
>   rt pri 99, so that the contention occurs in the end.
> - cycles per load: ~1.5K to 10.5K
> 
> This change mitigates `cpupri_vec->count`, `mask` related contentions by
> separating each count and mask into different cache lines.

Right.

> Note: The side effect of this change is that struct cpupri size is
> increased from 26 cache lines to 203 cache lines.

That is pretty horrible, but probably unavoidable.

> An alternative implementation of this patch could be separating `counts`
> and `masks` into 2 vectors in cpupri_vec (counts[] and masks[]), and
> add two paddings:
> 1. Between counts[0] and counts[1], since counts[0] is more frequently
>    updated than others.

That is completely workload specific; it is a direct consequence of your
(probably busted) priority assignment scheme.

> 2. Between the two vectors, since counts[] is read-write access  while
>    masks[] is read access when it stores pointers.
> 
> The alternative introduces the complexity of 31+/21- LoC changes,
> it achieves almost the same performance, at the same time, struct cpupri
> size is reduced from 26 cache lines to 21 cache lines.

That is not an alternative, since it very specifically only deals with
fifo-99 contention.

> ---
>  kernel/sched/cpupri.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> index d6cba0020064..245b0fa626be 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -9,7 +9,7 @@
>  
>  struct cpupri_vec {
>  	atomic_t		count;
> -	cpumask_var_t		mask;
> +	cpumask_var_t		mask	____cacheline_aligned;
>  };

At the very least this needs a comment, explaining the what and how of
it.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention
  2025-07-21  6:10 ` [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
@ 2026-03-20 10:18   ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2026-03-20 10:18 UTC (permalink / raw)
  To: Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen

On Mon, Jul 21, 2025 at 02:10:24PM +0800, Pan Deng wrote:
> When running a multi-instance FFmpeg workload on HCC system, significant
> contention is observed in root_domain cacheline 1 and 3.

What's a HCC? Hobby Computer Club? Google is telling me it is the most
prevalent form of liver cancer, but I somehow doubt that is what you're
on about.

> The SUT is a 2-socket machine with 240 physical cores and 480 logical

Satellite User Terminal? Subsea Umbilical Termination? Small Unit
Transceiver? Single Unit Test?

> CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
> (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> with FIFO scheduling. FPS is used as score.

Yes yes, poorly configured systems hurt.

> perf c2c tool reveals (sorted by contention severity):
> root_domain cache line 3:
> - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored,
>    since counts[0] is more frequently updated than others along with a
>    rt task enqueues an empty runq or dequeues from a non-overloaded runq.
> - `rto_mask` (0x30) is heavily loaded
> - `rto_loop_next` (0x24) and `rto_loop_start` (0x28) are frequently stored
> - `rto_push_work` (0x0) and `rto_lock` (0x18) are lightly accessed
> - cycles per load: ~10K to 59K
> 
> root_domain cache line 1:
> - `rto_count` (0x4) is frequently loaded/stored
> - `overloaded` (0x28) is heavily loaded
> - cycles per load: ~2.8K to 44K:
> 
> This change adjusts the layout of `root_domain` to isolate these contended
> fields across separate cache lines:
> 1. `rto_count` remains in the 1st cache line; `overloaded` and
>    `overutilized` are moved to the last cache line
> 2. `rto_push_work` is placed in the 2nd cache line
> 3. `rto_loop_start`, `rto_loop_next`, and `rto_lock` remain in the 3rd
>    cache line; `rto_mask` is moved near `pd` in the penultimate cache line
> 4. `cpupri` starts at the 4th cache line to prevent `pri_to_cpu[0].count`
>    contending with fields in cache line 3.
> 
> With this change:
> - FPS improves by ~5%
> - Kernel cycles% drops from ~20% to ~17.7%
> - root_domain cache line 3 no longer appears in perf-c2c report
> - cycles per load of root_domain cache line 1 is reduced to from
>   ~2.8K-44K to ~2.1K-2.7K
> - stress-ng cyclic benchmark is improved ~18.6%, command:
>   stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo   \
>                       --timeout 30 --minimize --metrics
> - rt-tests/pi_stress is improved ~4.7%, command:
>   rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))
> 
> According to the nature of the change, to my understanding, it doesn`t
> introduce any negative impact in other scenario.
> 
> Note: This change increases the size of `root_domain` from 29 to 31 cache
> lines, it's considered acceptable since `root_domain` is a single global
> object.

Uhm, what? We're at 207 cachelines due to that previous patch, remember?
A few more don't matter at this point I would guess.

It doesn't actually apply anymore, but it needs the very same that
previous patch did -- more comments.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  2025-07-21  6:10 ` [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
@ 2026-03-20 10:24   ` Peter Zijlstra
  2026-03-23 18:09     ` Tim Chen
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2026-03-20 10:24 UTC (permalink / raw)
  To: Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen

On Mon, Jul 21, 2025 at 02:10:25PM +0800, Pan Deng wrote:
> As a complementary, this patch splits
> `rto_count` into per-numa-node counters to reduce the contention.

Right... so Tim, didn't we have similar patches for task_group::load_avg
or something like that? Whatever did happen there? Can we share common
infra?

Also since Tim is sitting on this LLC infrastructure, can you compare
per-node and per-llc for this stuff? Somehow I'm thinking that a 2
socket 480 CPU system only has like 2 nodes and while splitting this
will help some, that might not be excellent.

Please test on both Intel and AMD systems, since AMD has more of these
LLC things on.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2025-07-21  6:10 ` [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
@ 2026-03-20 12:40   ` Peter Zijlstra
  2026-03-23 18:45     ` Tim Chen
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2026-03-20 12:40 UTC (permalink / raw)
  To: Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen

On Mon, Jul 21, 2025 at 02:10:26PM +0800, Pan Deng wrote:

> This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
> mitigate false sharing.

So I really do think we need something here. We're running into the
whole cpumask contention thing on a semi regular basis.

But somehow I doubt this is it.

I would suggest building a radix tree like structure based on ACPIID
-- which is inherently suitable for this given that is exactly how
CPUID-0b/1f are specified.

This of course makes it very much x86 specific, but perhaps other
architectures can provide similarly structured id spaces suitable for
this.

If you make it so that it reduces to a single large level (equivalent to
the normal bitmaps) when no intermediate masks are specific, it should
work for all, and then architectures can opt-in by providing a suitable
id space and masks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention
  2026-03-20  9:59 ` [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Peter Zijlstra
@ 2026-03-20 12:50   ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2026-03-20 12:50 UTC (permalink / raw)
  To: Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen

On Fri, Mar 20, 2026 at 10:59:55AM +0100, Peter Zijlstra wrote:
> On Mon, Jul 21, 2025 at 02:10:22PM +0800, Pan Deng wrote:
> > When running multi-instance FFmpeg workload in cloud environment,
> > cache line contention is severe during the access to root_domain data
> > structures, which significantly degrades performance.
> > 
> > The SUT is a 2-socket machine with 240 physical cores and 480 logical
> 
> What's a SUT?
> 
> > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
> > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> > with FIFO scheduling. FPS(frame per second) is used as score.
> 
> So I think we can do some of this, but that workload is hilariously
> poorly configured.
> 
> You're pinning things but not partitioning, why?  If you would have
> created 60 partitions, one for each FFmpeg thingy, then you wouldn't
> have needed any of this.
> 
> You're running at FIFO99 (IOW prio-0) and then claiming prio-0 is used
> more heavily than others... will d0h.  What priority assignment scheme
> led to this? Is there a sensible reason these must be 99?
> 

Also, you failed the most basic of tasks, Cc all the relevant people. I
would've hoped at least some of the 'reviewer' you had would've told you
about that.

Notably, Steve is the one that often looks after this RT stuff.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  2026-03-20 10:24   ` Peter Zijlstra
@ 2026-03-23 18:09     ` Tim Chen
  2026-03-24 12:16       ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Tim Chen @ 2026-03-23 18:09 UTC (permalink / raw)
  To: Peter Zijlstra, Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, yu.c.chen

On Fri, 2026-03-20 at 11:24 +0100, Peter Zijlstra wrote:
> On Mon, Jul 21, 2025 at 02:10:25PM +0800, Pan Deng wrote:
> > As a complementary, this patch splits
> > `rto_count` into per-numa-node counters to reduce the contention.
> 
> Right... so Tim, didn't we have similar patches for task_group::load_avg
> or something like that? Whatever did happen there? Can we share common
> infra?

We did talk about introducing per NUMA counter for load_avg. We went with
limiting the update rate of load_avg to not more than once per msec
in commit 1528c661c24b4 to control the cache bounce.

> 
> Also since Tim is sitting on this LLC infrastructure, can you compare
> per-node and per-llc for this stuff? Somehow I'm thinking that a 2
> socket 480 CPU system only has like 2 nodes and while splitting this
> will help some, that might not be excellent.

You mean enhancing the per NUMA counter to per LLC? I think that makes
sense to reduce the LLC cache bounce if there are multiple LLCs per
NUMA node.

Tim

> 
> Please test on both Intel and AMD systems, since AMD has more of these
> LLC things on.
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-03-20 12:40   ` Peter Zijlstra
@ 2026-03-23 18:45     ` Tim Chen
  2026-03-24 12:00       ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Tim Chen @ 2026-03-23 18:45 UTC (permalink / raw)
  To: Peter Zijlstra, Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, yu.c.chen

On Fri, 2026-03-20 at 13:40 +0100, Peter Zijlstra wrote:
> On Mon, Jul 21, 2025 at 02:10:26PM +0800, Pan Deng wrote:
> 
> > This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
> > mitigate false sharing.
> 
> So I really do think we need something here. We're running into the
> whole cpumask contention thing on a semi regular basis.
> 
> But somehow I doubt this is it.
> 
> I would suggest building a radix tree like structure based on ACPIID
> -- which is inherently suitable for this given that is exactly how
> CPUID-0b/1f are specified.
> 

Are you thinking about replacing cpumask in cpupri_vec with something like xarray?
And a question on using ACPIID for the CPU as index instead of CPUID. 
Is it because you want to even out access in the tree?

Tim

> This of course makes it very much x86 specific, but perhaps other
> architectures can provide similarly structured id spaces suitable for
> this.
> 
> If you make it so that it reduces to a single large level (equivalent to
> the normal bitmaps) when no intermediate masks are specific, it should
> work for all, and then architectures can opt-in by providing a suitable
> id space and masks.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  2026-03-20 10:09   ` Peter Zijlstra
@ 2026-03-24  9:36     ` Deng, Pan
  2026-03-24 12:11       ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Deng, Pan @ 2026-03-24  9:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo@kernel.org, rostedt@goodmis.org,
	linux-kernel@vger.kernel.org, Li, Tianyou,
	tim.c.chen@linux.intel.com, Chen, Yu C

 > On Mon, Jul 21, 2025 at 02:10:23PM +0800, Pan Deng wrote:
> > When running a multi-instance FFmpeg workload on an HCC system,
> significant
> > cache line contention is observed around `cpupri_vec->count` and `mask` in
> > struct root_domain.
> >
> > The SUT is a 2-socket machine with 240 physical cores and 480 logical
> > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
> > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> > with FIFO scheduling. FPS is used as score.
> >
> > perf c2c tool reveals:
> > root_domain cache line 3:
> > - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored
> >    and contends with other fields, since counts[0] is more frequently
> >    updated than others along with a rt task enqueues an empty runq or
> >    dequeues from a non-overloaded runq.
> > - cycles per load: ~10K to 59K
> >
> > cpupri's last cache line:
> > - `cpupri_vec->count` and `mask` contends. The transcoding threads use
> >   rt pri 99, so that the contention occurs in the end.
> > - cycles per load: ~1.5K to 10.5K
> >
> > This change mitigates `cpupri_vec->count`, `mask` related contentions by
> > separating each count and mask into different cache lines.
> 
> Right.
> 
> > Note: The side effect of this change is that struct cpupri size is
> > increased from 26 cache lines to 203 cache lines.
> 
> That is pretty horrible, but probably unavoidable.
> 
> > An alternative implementation of this patch could be separating `counts`
> > and `masks` into 2 vectors in cpupri_vec (counts[] and masks[]), and
> > add two paddings:
> > 1. Between counts[0] and counts[1], since counts[0] is more frequently
> >    updated than others.
> 
> That is completely workload specific; it is a direct consequence of your
> (probably busted) priority assignment scheme.
> 
> > 2. Between the two vectors, since counts[] is read-write access  while
> >    masks[] is read access when it stores pointers.
> >
> > The alternative introduces the complexity of 31+/21- LoC changes,
> > it achieves almost the same performance, at the same time, struct cpupri
> > size is reduced from 26 cache lines to 21 cache lines.
> 
> That is not an alternative, since it very specifically only deals with
> fifo-99 contention.
> 
> > ---
> >  kernel/sched/cpupri.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> > index d6cba0020064..245b0fa626be 100644
> > --- a/kernel/sched/cpupri.h
> > +++ b/kernel/sched/cpupri.h
> > @@ -9,7 +9,7 @@
> >
> >  struct cpupri_vec {
> >  	atomic_t		count;
> > -	cpumask_var_t		mask;
> > +	cpumask_var_t		mask	____cacheline_aligned;
> >  };
> 
> At the very least this needs a comment, explaining the what and how of
> it.

Hi Peter,

Thank you very much for helping look at this patch series.
Before digging into the details, please let me briefly describe the
structure of this patch set.
Each patch builds incrementally on the previous ones, with patch 1
improving performance by 11%, patch 1+2 improving by 12%, patch 1+2+3
improving by 13%, and patch 1+2+3+4 by 16%.
Since patch 1 gives the most benefit and is simple enough, we are
planning to address the first issue in patch 1 and try to push this
patch first, then address your comments in remained patches.
We'll investigate a more generic method to solve the global contention
issue as you proposed in patch 3 and patch 4, and we are planning to do
that on multi-LLC system as well(Intel and AMD).
Regarding this patch, yes, using cacheline aligned could increase potential
memory usage.
After internal discussion, we are thinking of an alternative method to
mitigate the waste of memory usage, that is, using kmalloc() to allocate
count in a different memory space rather than placing the count and
cpumask together in this structure. The rationale is that, writing to
address pointed by the counter and reading the address from cpumask
is isolated in different memory space which could reduce the ratio of
cache false sharing, besides, kmalloc() based on slub/slab could place
the objects in different cache lines to reduce the cache contention.
The drawback of dynamic allocation counter is that, we have to maintain
the life cycle of the counters.
Could you please advise if sticking with current cache_align attribute
method or using kmalloc() is preferred?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-03-23 18:45     ` Tim Chen
@ 2026-03-24 12:00       ` Peter Zijlstra
  2026-03-31  5:37         ` Chen, Yu C
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2026-03-24 12:00 UTC (permalink / raw)
  To: Tim Chen
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li, yu.c.chen,
	kprateek.nayak

On Mon, Mar 23, 2026 at 11:45:01AM -0700, Tim Chen wrote:
> On Fri, 2026-03-20 at 13:40 +0100, Peter Zijlstra wrote:
> > On Mon, Jul 21, 2025 at 02:10:26PM +0800, Pan Deng wrote:
> > 
> > > This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
> > > mitigate false sharing.
> > 
> > So I really do think we need something here. We're running into the
> > whole cpumask contention thing on a semi regular basis.
> > 
> > But somehow I doubt this is it.
> > 
> > I would suggest building a radix tree like structure based on ACPIID
> > -- which is inherently suitable for this given that is exactly how
> > CPUID-0b/1f are specified.
> > 
> 
> Are you thinking about replacing cpumask in cpupri_vec with something like xarray?
> And a question on using ACPIID for the CPU as index instead of CPUID. 
> Is it because you want to even out access in the tree?

Sorry, s/ACPI/APIC/, I keep cursing the sadist that put those two
acronyms together.

No, because I want it sorted by topology. Per virtue of CPUID-b/1f the
APIC-ID is in topology order.

Perhaps a little something like this. It will obviously only build on
x86, and then only boot for those that have <=64 CPUs in their DIE
domain.

This needs ARCH_HAS_SBM and a cpumask based fallback implementation of
sbm at the very least.

Now, I was hoping AMD EPYC would have their CCD things as a topology
level, but going by the MADT/CPUID dump I got from Boris, this is not
the case. So we need to manually insert that level and hope the APIC-ID
range is nicely setup for that, or they need to do worse things still.

I think that for things like DMR DIE DTRT, but I've not yet seem one
upclose :/

Also, I really wish all the SNC capable chips would have the SNC domains
enumerated, even if SNC is not in use. This x86 topology enumeration
stuff is such a shit show :-(

Anyway, random hackery below, it basically does a 2 level structure
where the leaf is a whole cacheline (double check the kzalloc_obj()
stuff respects alignment) and we make sure the CPUs for that leaf are
actually from the same cache domain. At least, that's the theory, see
above ranting on the glories of topology enumeration.

It boots in qemu with --cpus 16,sockets=2,dies=2 and appears to 'work'.
YMMV

This code is very much a PoC, treat it as such.

---
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 9cd493d467d4..24012a91ac1e 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -54,6 +54,7 @@ static inline void x86_32_probe_apic(void) { }
 #endif
 
 extern u32 cpuid_to_apicid[];
+extern u32 apicid_to_cpuid[];
 
 #define CPU_ACPIID_INVALID	U32_MAX
 
diff --git a/arch/x86/include/asm/sbm.h b/arch/x86/include/asm/sbm.h
new file mode 100644
index 000000000000..9a4d283347d1
--- /dev/null
+++ b/arch/x86/include/asm/sbm.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/apic.h>
+
+static __always_inline u32 arch_sbm_cpu_to_idx(unsigned int cpu)
+{
+	return cpuid_to_apicid[cpu];
+}
+
+static __always_inline u32 arch_sbm_idx_to_cpu(unsigned int idx)
+{
+	return apicid_to_cpuid[idx];
+}
diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
index eafcb1fc185a..6f3d18288600 100644
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -48,6 +48,12 @@ DECLARE_BITMAP(phys_cpu_present_map, MAX_LOCAL_APIC) __read_mostly;
 
 /* Used for CPU number allocation and parallel CPU bringup */
 u32 cpuid_to_apicid[] __ro_after_init = { [0 ... NR_CPUS - 1] = BAD_APICID, };
+u32 apicid_to_cpuid[MAX_LOCAL_APIC] = { 0 };
+
+u32 arch_sbm_leafs	__ro_after_init;
+u32 arch_sbm_shift	__ro_after_init;
+u32 arch_sbm_mask	__ro_after_init;
+u32 arch_sbm_bits	__ro_after_init;
 
 /* Bitmaps to mark registered APICs at each topology domain */
 static struct { DECLARE_BITMAP(map, MAX_LOCAL_APIC); } apic_maps[TOPO_MAX_DOMAIN] __ro_after_init;
@@ -234,6 +240,7 @@ static __init void topo_register_apic(u32 apic_id, u32 acpi_id, bool present)
 			cpu = topo_get_cpunr(apic_id);
 
 		cpuid_to_apicid[cpu] = apic_id;
+		apicid_to_cpuid[apic_id] = cpu;
 		topo_set_cpuids(cpu, apic_id, acpi_id);
 	} else {
 		topo_info.nr_disabled_cpus++;
@@ -537,7 +544,9 @@ void __init topology_init_possible_cpus(void)
 					      MAX_LOCAL_APIC, apicid);
 		if (apicid >= MAX_LOCAL_APIC)
 			break;
-		cpuid_to_apicid[topo_info.nr_assigned_cpus++] = apicid;
+		cpu = topo_info.nr_assigned_cpus++;
+		cpuid_to_apicid[cpu] = apicid;
+		apicid_to_cpuid[apicid] = cpu;
 	}
 
 	for (cpu = 0; cpu < allowed; cpu++) {
@@ -551,6 +560,17 @@ void __init topology_init_possible_cpus(void)
 		cpu_mark_primary_thread(cpu, apicid);
 		set_cpu_present(cpu, test_bit(apicid, phys_cpu_present_map));
 	}
+
+	apicid = 0;
+	for_each_possible_cpu(cpu)
+		apicid = max(apicid, cpuid_to_apicid[cpu]);
+
+	arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1;
+	arch_sbm_leafs = 1 + (apicid >> arch_sbm_shift);
+	arch_sbm_mask = (1 << arch_sbm_shift) - 1;
+	arch_sbm_bits = arch_sbm_shift;
+
+	pr_info("SBM: shift(%d) leafs(%d) APIC(%x)\n", arch_sbm_shift, arch_sbm_leafs, apicid);
 }
 
 /*
diff --git a/include/linux/sbm.h b/include/linux/sbm.h
new file mode 100644
index 000000000000..8beade6c0585
--- /dev/null
+++ b/include/linux/sbm.h
@@ -0,0 +1,83 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SBM_H
+#define _LINUX_SBM_H
+
+#include <linux/slab.h>
+#include <linux/bitmap.h>
+#include <linux/cpumask.h>
+#include <asm/sbm.h>
+
+extern unsigned int arch_sbm_leafs;
+extern unsigned int arch_sbm_shift;
+extern unsigned int arch_sbm_mask;
+extern unsigned int arch_sbm_bits;
+
+extern unsigned int arch_sbm_cpu_to_idx(unsigned int cpu);
+extern unsigned int arch_sbm_idx_to_cpu(unsigned int idx);
+
+enum sbm_type {
+	st_root = 0,
+	st_leaf,
+};
+
+struct sbm_root {
+	enum sbm_type	type;
+	unsigned int	nr;
+	struct sbm_leaf *leafs[] __counted_by(nr);
+};
+
+struct sbm_leaf {
+	enum sbm_type	type;
+	unsigned long	bitmap;
+} ____cacheline_aligned;
+
+struct sbm {
+	enum sbm_type	type;
+};
+
+extern struct sbm *sbm_alloc(void);
+extern unsigned int sbm_find_next_bit(struct sbm *sbm, int start);
+
+#define __sbm_op(sbm, func)				\
+({							\
+	struct sbm_leaf *leaf = (void *)sbm;		\
+	int idx = arch_sbm_cpu_to_idx(cpu);		\
+	if (sbm->type == st_root) {			\
+		struct sbm_root *root = (void *)sbm;	\
+		int nr = idx >> arch_sbm_shift;		\
+		leaf = root->leafs[nr];			\
+	}						\
+	int bit = idx & arch_sbm_mask;			\
+	func(bit, &leaf->bitmap);			\
+})
+
+static inline void sbm_cpu_set(struct sbm *sbm, int cpu)
+{
+	__sbm_op(sbm, set_bit);
+}
+
+static inline void sbm_cpu_clear(struct sbm *sbm, int cpu)
+{
+	__sbm_op(sbm, clear_bit);
+}
+
+static inline void __sbm_cpu_set(struct sbm *sbm, int cpu)
+{
+	__sbm_op(sbm, __set_bit);
+}
+
+static inline void __sbm_cpu_clear(struct sbm *sbm, int cpu)
+{
+	__sbm_op(sbm, __clear_bit);
+}
+
+static inline bool sbm_cpu_test(struct sbm *sbm, int cpu)
+{
+	return __sbm_op(sbm, test_bit);
+}
+
+#define sbm_for_each_set_bit(sbm, idx) \
+	for (int idx = sbm_find_next_bit(sbm, 0); \
+	     idx >= 0; idx = sbm_find_next_bit(sbm, idx+1))
+
+#endif /* _LINUX_SBM_H */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 226509231e67..a3a423c4706e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -49,6 +49,7 @@
 #include <linux/ratelimit.h>
 #include <linux/task_work.h>
 #include <linux/rbtree_augmented.h>
+#include <linux/sbm.h>
 
 #include <asm/switch_to.h>
 
@@ -7384,7 +7385,7 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
 #ifdef CONFIG_NO_HZ_COMMON
 
 static struct {
-	cpumask_var_t idle_cpus_mask;
+	struct sbm *sbm;
 	int has_blocked_load;		/* Idle CPUS has blocked load */
 	int needs_update;		/* Newly idle CPUs need their next_balance collated */
 	unsigned long next_balance;     /* in jiffy units */
@@ -12615,12 +12616,11 @@ static inline int on_null_domain(struct rq *rq)
 static inline int find_new_ilb(void)
 {
 	int this_cpu = smp_processor_id();
-	const struct cpumask *hk_mask;
 	int ilb_cpu;
 
-	hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE);
+	sbm_for_each_set_bit(nohz.sbm, idx) {
+		ilb_cpu = arch_sbm_idx_to_cpu(idx);
 
-	for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) {
 		if (ilb_cpu == this_cpu)
 			continue;
 
@@ -12685,7 +12685,7 @@ static void nohz_balancer_kick(struct rq *rq)
 	unsigned long now = jiffies;
 	struct sched_domain_shared *sds;
 	struct sched_domain *sd;
-	int nr_busy, i, cpu = rq->cpu;
+	int nr_busy, cpu = rq->cpu;
 	unsigned int flags = 0;
 
 	if (unlikely(rq->idle_balance))
@@ -12713,13 +12713,6 @@ static void nohz_balancer_kick(struct rq *rq)
 	if (time_before(now, nohz.next_balance))
 		goto out;
 
-	/*
-	 * None are in tickless mode and hence no need for NOHZ idle load
-	 * balancing
-	 */
-	if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
-		return;
-
 	if (rq->nr_running >= 2) {
 		flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
 		goto out;
@@ -12739,24 +12732,6 @@ static void nohz_balancer_kick(struct rq *rq)
 		}
 	}
 
-	sd = rcu_dereference_all(per_cpu(sd_asym_packing, cpu));
-	if (sd) {
-		/*
-		 * When ASYM_PACKING; see if there's a more preferred CPU
-		 * currently idle; in which case, kick the ILB to move tasks
-		 * around.
-		 *
-		 * When balancing between cores, all the SMT siblings of the
-		 * preferred CPU must be idle.
-		 */
-		for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) {
-			if (sched_asym(sd, i, cpu)) {
-				flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK;
-				goto unlock;
-			}
-		}
-	}
-
 	sd = rcu_dereference_all(per_cpu(sd_asym_cpucapacity, cpu));
 	if (sd) {
 		/*
@@ -12829,7 +12804,8 @@ void nohz_balance_exit_idle(struct rq *rq)
 		return;
 
 	rq->nohz_tick_stopped = 0;
-	cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
+	if (cpumask_test_cpu(rq->cpu, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)))
+		sbm_cpu_clear(nohz.sbm, rq->cpu);
 
 	set_cpu_sd_state_busy(rq->cpu);
 }
@@ -12886,7 +12862,8 @@ void nohz_balance_enter_idle(int cpu)
 
 	rq->nohz_tick_stopped = 1;
 
-	cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
+	if (cpumask_test_cpu(rq->cpu, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)))
+		sbm_cpu_set(nohz.sbm, rq->cpu);
 
 	/*
 	 * Ensures that if nohz_idle_balance() fails to observe our
@@ -12913,7 +12890,7 @@ static bool update_nohz_stats(struct rq *rq)
 	if (!rq->has_blocked_load)
 		return false;
 
-	if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
+	if (!sbm_cpu_test(nohz.sbm, cpu))
 		return false;
 
 	if (!time_after(jiffies, READ_ONCE(rq->last_blocked_load_update_tick)))
@@ -12967,7 +12944,9 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 	 * Start with the next CPU after this_cpu so we will end with this_cpu and let a
 	 * chance for other idle cpu to pull load.
 	 */
-	for_each_cpu_wrap(balance_cpu,  nohz.idle_cpus_mask, this_cpu+1) {
+	sbm_for_each_set_bit(nohz.sbm, idx) {
+		balance_cpu = arch_sbm_idx_to_cpu(idx);
+
 		if (!idle_cpu(balance_cpu))
 			continue;
 
@@ -14250,6 +14229,6 @@ __init void init_sched_fair_class(void)
 #ifdef CONFIG_NO_HZ_COMMON
 	nohz.next_balance = jiffies;
 	nohz.next_blocked = jiffies;
-	zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT);
+	nohz.sbm = sbm_alloc();
 #endif
 }
diff --git a/lib/Makefile b/lib/Makefile
index 1b9ee167517f..8d1f6b5327d5 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -40,7 +40,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
 	 earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
 	 nmi_backtrace.o win_minmax.o memcat_p.o \
-	 buildid.o objpool.o iomem_copy.o sys_info.o
+	 buildid.o objpool.o iomem_copy.o sys_info.o sbm.o
 
 lib-$(CONFIG_UNION_FIND) += union_find.o
 lib-$(CONFIG_PRINTK) += dump_stack.o
diff --git a/lib/sbm.c b/lib/sbm.c
new file mode 100644
index 000000000000..167cf857cd32
--- /dev/null
+++ b/lib/sbm.c
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/sbm.h>
+
+struct sbm *sbm_alloc(void)
+{
+	unsigned int nr = arch_sbm_leafs;
+	struct sbm_root *root = kzalloc_flex(*root, leafs, nr);
+	struct sbm_leaf *leaf;
+	if (!root)
+		return NULL;
+
+	root->type = st_root;
+
+	for (int i = 0; i < nr; i++) {
+		leaf = kzalloc_obj(*leaf);
+		if (!leaf)
+			goto fail;
+		leaf->type = st_leaf;
+		root->leafs[i] = leaf;
+	}
+
+	if (nr == 1) {
+		leaf = root->leafs[0];
+		kfree(root);
+		return (void *)leaf;
+	}
+
+	return (void *)root;
+
+fail:
+	for (int i = 0; i < nr; i++)
+		kfree(root->leafs[i]);
+	kfree(root);
+	return NULL;
+}
+
+unsigned int sbm_find_next_bit(struct sbm *sbm, int start)
+{
+	struct sbm_leaf *leaf = (void *)sbm;
+	struct sbm_root *root = (void *)sbm;
+	int nr = start >> arch_sbm_shift;
+	int bit = start & arch_sbm_mask;
+	unsigned long tmp, mask = (~0UL) << bit;
+	if (sbm->type == st_root) {
+		for (; nr < arch_sbm_leafs; nr++, mask = ~0UL) {
+			leaf = root->leafs[nr];
+			tmp = leaf->bitmap & mask;
+			if (!tmp)
+				continue;
+		}
+	} else {
+		tmp = leaf->bitmap & mask;
+	}
+	if (!tmp)
+		return -1;
+	return (nr << arch_sbm_shift) | __ffs(tmp);
+}

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  2026-03-24  9:36     ` Deng, Pan
@ 2026-03-24 12:11       ` Peter Zijlstra
  2026-03-27 10:17         ` Deng, Pan
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2026-03-24 12:11 UTC (permalink / raw)
  To: Deng, Pan
  Cc: mingo@kernel.org, rostedt@goodmis.org,
	linux-kernel@vger.kernel.org, Li, Tianyou,
	tim.c.chen@linux.intel.com, Chen, Yu C

On Tue, Mar 24, 2026 at 09:36:14AM +0000, Deng, Pan wrote:

> Regarding this patch, yes, using cacheline aligned could increase potential
> memory usage.
> After internal discussion, we are thinking of an alternative method to
> mitigate the waste of memory usage, that is, using kmalloc() to allocate
> count in a different memory space rather than placing the count and
> cpumask together in this structure. The rationale is that, writing to
> address pointed by the counter and reading the address from cpumask
> is isolated in different memory space which could reduce the ratio of
> cache false sharing, besides, kmalloc() based on slub/slab could place
> the objects in different cache lines to reduce the cache contention.
> The drawback of dynamic allocation counter is that, we have to maintain
> the life cycle of the counters.
> Could you please advise if sticking with current cache_align attribute
> method or using kmalloc() is preferred?

Well, you'd have to allocate a full cacheline anyway. If you allocate N
4 byte (counter) objects, there's a fair chance they end up in the same
cacheline (its a SLAB after all) and then you're back to having a ton of
false sharing.

Anyway, for you specific workload, why isn't partitioning a viable
solution? It would not need any kernel modifications and would get rid
of the contention entirely.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  2026-03-23 18:09     ` Tim Chen
@ 2026-03-24 12:16       ` Peter Zijlstra
  2026-03-24 22:40         ` Tim Chen
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2026-03-24 12:16 UTC (permalink / raw)
  To: Tim Chen; +Cc: Pan Deng, mingo, linux-kernel, tianyou.li, yu.c.chen, x86

On Mon, Mar 23, 2026 at 11:09:24AM -0700, Tim Chen wrote:
> On Fri, 2026-03-20 at 11:24 +0100, Peter Zijlstra wrote:
> > On Mon, Jul 21, 2025 at 02:10:25PM +0800, Pan Deng wrote:
> > > As a complementary, this patch splits
> > > `rto_count` into per-numa-node counters to reduce the contention.
> > 
> > Right... so Tim, didn't we have similar patches for task_group::load_avg
> > or something like that? Whatever did happen there? Can we share common
> > infra?
> 
> We did talk about introducing per NUMA counter for load_avg. We went with
> limiting the update rate of load_avg to not more than once per msec
> in commit 1528c661c24b4 to control the cache bounce.
> 
> > 
> > Also since Tim is sitting on this LLC infrastructure, can you compare
> > per-node and per-llc for this stuff? Somehow I'm thinking that a 2
> > socket 480 CPU system only has like 2 nodes and while splitting this
> > will help some, that might not be excellent.
> 
> You mean enhancing the per NUMA counter to per LLC? I think that makes
> sense to reduce the LLC cache bounce if there are multiple LLCs per
> NUMA node.

Does that system have multiple LLCs? Realistically, it would probably
improve things if we could split these giant stupid LLCs along the same
lines SNC does.

I still have the below terrible hack that I've been using to diagnose
and test all these multi-llc patches/regressions etc. Funnily enough its
been good enough to actually show some of the issues.



---
Subject: x86/topology: Add paramter to split LLC
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu Feb 19 12:11:16 CET 2026

Add a (debug) option to virtually split the LLC, no CAT involved, just fake
topology. Used to test code that depends (either in behaviour or directly) on
there being multiple LLC domains in a node.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 Documentation/admin-guide/kernel-parameters.txt |   12 ++++++++++++
 arch/x86/include/asm/processor.h                |    5 +++++
 arch/x86/kernel/smpboot.c                       |   20 ++++++++++++++++++++
 3 files changed, 37 insertions(+)

--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -7241,6 +7241,18 @@ Kernel parameters
 			Not specifying this option is equivalent to
 			spec_store_bypass_disable=auto.
 
+	split_llc=
+			[X86,EARLY] Split the LLC N-ways
+
+			When set, the LLC is split this many ways by matching
+			'core_id % n'. This is setup before SMP bringup and
+			used during SMP bringup before it knows the full
+			topology. If your core count doesn't nicely divide by
+			the number given, you get to keep the pieces.
+
+			This is mostly a debug feature to emulate multiple LLCs
+			on hardware that only have a single LLC.
+
 	split_lock_detect=
 			[X86] Enable split lock detection or bus lock detection
 
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -699,6 +699,11 @@ static inline u32 per_cpu_l2c_id(unsigne
 	return per_cpu(cpu_info.topo.l2c_id, cpu);
 }
 
+static inline u32 per_cpu_core_id(unsigned int cpu)
+{
+	return per_cpu(cpu_info.topo.core_id, cpu);
+}
+
 #ifdef CONFIG_CPU_SUP_AMD
 /*
  * Issue a DIV 0/1 insn to clear any division data from previous DIV
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -424,6 +424,21 @@ static const struct x86_cpu_id intel_cod
 	{}
 };
 
+/*
+ * Allows splitting the LLC by matching 'core_id % split_llc'.
+ *
+ * This is mostly a debug hack to emulate systems with multiple LLCs per node
+ * on systems that do not naturally have this.
+ */
+static unsigned int split_llc = 0;
+
+static int __init split_llc_setup(char *str)
+{
+	get_option(&str, &split_llc);
+	return 0;
+}
+early_param("split_llc", split_llc_setup);
+
 static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 {
 	const struct x86_cpu_id *id = x86_match_cpu(intel_cod_cpu);
@@ -438,6 +453,11 @@ static bool match_llc(struct cpuinfo_x86
 	if (per_cpu_llc_id(cpu1) != per_cpu_llc_id(cpu2))
 		return false;
 
+	if (split_llc &&
+	    (per_cpu_core_id(cpu1) % split_llc) !=
+	    (per_cpu_core_id(cpu2) % split_llc))
+		return false;
+
 	/*
 	 * Allow the SNC topology without warning. Return of false
 	 * means 'c' does not share the LLC of 'o'. This will be

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  2026-03-24 12:16       ` Peter Zijlstra
@ 2026-03-24 22:40         ` Tim Chen
  0 siblings, 0 replies; 41+ messages in thread
From: Tim Chen @ 2026-03-24 22:40 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Pan Deng, mingo, linux-kernel, tianyou.li, yu.c.chen, x86

On Tue, 2026-03-24 at 13:16 +0100, Peter Zijlstra wrote:
> On Mon, Mar 23, 2026 at 11:09:24AM -0700, Tim Chen wrote:
> > On Fri, 2026-03-20 at 11:24 +0100, Peter Zijlstra wrote:
> > > On Mon, Jul 21, 2025 at 02:10:25PM +0800, Pan Deng wrote:
> > > > As a complementary, this patch splits
> > > > `rto_count` into per-numa-node counters to reduce the contention.
> > > 
> > > Right... so Tim, didn't we have similar patches for task_group::load_avg
> > > or something like that? Whatever did happen there? Can we share common
> > > infra?
> > 
> > We did talk about introducing per NUMA counter for load_avg. We went with
> > limiting the update rate of load_avg to not more than once per msec
> > in commit 1528c661c24b4 to control the cache bounce.
> > 
> > > 
> > > Also since Tim is sitting on this LLC infrastructure, can you compare
> > > per-node and per-llc for this stuff? Somehow I'm thinking that a 2
> > > socket 480 CPU system only has like 2 nodes and while splitting this
> > > will help some, that might not be excellent.
> > 
> > You mean enhancing the per NUMA counter to per LLC? I think that makes
> > sense to reduce the LLC cache bounce if there are multiple LLCs per
> > NUMA node.
> 
> Does that system have multiple LLCs? Realistically, it would probably
> improve things if we could split these giant stupid LLCs along the same
> lines SNC does.

The system that Pan tested does not have multiple LLCs per node. But
future Intel systems and current AMD systems do.  So it make sense
to start thinking about having a per LLC count infrastructure.

We could create a per LLC counter library, kind of like the percpu counter
we already have. We can leverage compact LLC id assignment in the cache aware scheduling
patches to allocate arrays indexed by LLC id.  The caveat is if such LLC
count is used during early boot before LLCs are enumerated in the topology code, we may need to
put do accounting in a global count, till the per LLC count gets enumerated
and we know the right size of the LLC array.  And we'll also need to
check if new LLC come online or offline and handle things accordingly.

That sounds reasonable?

Tim
> 
> I still have the below terrible hack that I've been using to diagnose
> and test all these multi-llc patches/regressions etc. Funnily enough its
> been good enough to actually show some of the issues.
> 
> 
> 
> ---
> Subject: x86/topology: Add paramter to split LLC
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Thu Feb 19 12:11:16 CET 2026
> 
> Add a (debug) option to virtually split the LLC, no CAT involved, just fake
> topology. Used to test code that depends (either in behaviour or directly) on
> there being multiple LLC domains in a node.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   12 ++++++++++++
>  arch/x86/include/asm/processor.h                |    5 +++++
>  arch/x86/kernel/smpboot.c                       |   20 ++++++++++++++++++++
>  3 files changed, 37 insertions(+)
> 
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -7241,6 +7241,18 @@ Kernel parameters
>  			Not specifying this option is equivalent to
>  			spec_store_bypass_disable=auto.
>  
> +	split_llc=
> +			[X86,EARLY] Split the LLC N-ways
> +
> +			When set, the LLC is split this many ways by matching
> +			'core_id % n'. This is setup before SMP bringup and
> +			used during SMP bringup before it knows the full
> +			topology. If your core count doesn't nicely divide by
> +			the number given, you get to keep the pieces.
> +
> +			This is mostly a debug feature to emulate multiple LLCs
> +			on hardware that only have a single LLC.
> +
>  	split_lock_detect=
>  			[X86] Enable split lock detection or bus lock detection
>  
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -699,6 +699,11 @@ static inline u32 per_cpu_l2c_id(unsigne
>  	return per_cpu(cpu_info.topo.l2c_id, cpu);
>  }
>  
> +static inline u32 per_cpu_core_id(unsigned int cpu)
> +{
> +	return per_cpu(cpu_info.topo.core_id, cpu);
> +}
> +
>  #ifdef CONFIG_CPU_SUP_AMD
>  /*
>   * Issue a DIV 0/1 insn to clear any division data from previous DIV
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -424,6 +424,21 @@ static const struct x86_cpu_id intel_cod
>  	{}
>  };
>  
> +/*
> + * Allows splitting the LLC by matching 'core_id % split_llc'.
> + *
> + * This is mostly a debug hack to emulate systems with multiple LLCs per node
> + * on systems that do not naturally have this.
> + */
> +static unsigned int split_llc = 0;
> +
> +static int __init split_llc_setup(char *str)
> +{
> +	get_option(&str, &split_llc);
> +	return 0;
> +}
> +early_param("split_llc", split_llc_setup);
> +
>  static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
>  {
>  	const struct x86_cpu_id *id = x86_match_cpu(intel_cod_cpu);
> @@ -438,6 +453,11 @@ static bool match_llc(struct cpuinfo_x86
>  	if (per_cpu_llc_id(cpu1) != per_cpu_llc_id(cpu2))
>  		return false;
>  
> +	if (split_llc &&
> +	    (per_cpu_core_id(cpu1) % split_llc) !=
> +	    (per_cpu_core_id(cpu2) % split_llc))
> +		return false;
> +
>  	/*
>  	 * Allow the SNC topology without warning. Return of false
>  	 * means 'c' does not share the LLC of 'o'. This will be

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  2026-03-24 12:11       ` Peter Zijlstra
@ 2026-03-27 10:17         ` Deng, Pan
  2026-04-02 10:37           ` Deng, Pan
  2026-04-02 10:43           ` Peter Zijlstra
  0 siblings, 2 replies; 41+ messages in thread
From: Deng, Pan @ 2026-03-27 10:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo@kernel.org, rostedt@goodmis.org,
	linux-kernel@vger.kernel.org, Li, Tianyou,
	tim.c.chen@linux.intel.com, Chen, Yu C

> 
> On Tue, Mar 24, 2026 at 09:36:14AM +0000, Deng, Pan wrote:
> 
> > Regarding this patch, yes, using cacheline aligned could increase potential
> > memory usage.
> > After internal discussion, we are thinking of an alternative method to
> > mitigate the waste of memory usage, that is, using kmalloc() to allocate
> > count in a different memory space rather than placing the count and
> > cpumask together in this structure. The rationale is that, writing to
> > address pointed by the counter and reading the address from cpumask
> > is isolated in different memory space which could reduce the ratio of
> > cache false sharing, besides, kmalloc() based on slub/slab could place
> > the objects in different cache lines to reduce the cache contention.
> > The drawback of dynamic allocation counter is that, we have to maintain
> > the life cycle of the counters.
> > Could you please advise if sticking with current cache_align attribute
> > method or using kmalloc() is preferred?
> 
> Well, you'd have to allocate a full cacheline anyway. If you allocate N
> 4 byte (counter) objects, there's a fair chance they end up in the same
> cacheline (its a SLAB after all) and then you're back to having a ton of
> false sharing.
> 
> Anyway, for you specific workload, why isn't partitioning a viable
> solution? It would not need any kernel modifications and would get rid
> of the contention entirely.

Thank you very much for pointing this out.

We understand cpuset partitioning would eliminate the contention.
However, in managed container platforms (e.g., Kubernetes), users can
obtain RT capabilities for their workloads via CAP_SYS_NICE, but they
don't have host-level privileges to create cpuset partitions.

Besides the cache line align approach, regarding both the contention and
memory overhead, would it be possible to consider the alternative approach
as follow:

1. Use {counts[], masks[]} instead of vec[{count, mask}]

2. Separate counts[0] (CPUPRI_NORMAL), who experiences both heavy
   write and read traffic.
   Writes: RT task lifecycle operations (enqueue on empty runqueue,
   dequeue from non-overloaded runqueue) frequently update the
   normal priority count.
   Reads: RT tasks searching for available CPUs scan from low to high
   priority, with counts[0] being checked at the start of every
   search iteration.
   So that even if workloads used lower RT priorities (e.g., RT pri 49
   instead of 99), counts[0] contention would still be heavy, not specific
   to the pri-99 workload configuration.

3. Separate masks from counts to ensure no contention between them.

With the change struct cpupri size can be reduced from 26 cache lines to
21 cache lines, which saves more memory in cpuset partitioning scenarios.

code change looks like this:
---
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index 42c40cfdf836..1e333e6edb1e 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -64,16 +64,38 @@ static int convert_prio(int prio)
 	return cpupri;
 }
 
+/*
+ * Get pointer to count for given priority index.
+ *
+ * Skip padding after counts[0] for idx > 0 to access the correct location.
+ */
+static inline atomic_t *cpupri_count(struct cpupri *cp, int idx)
+{
+	if (idx > 0)
+		idx += CPUPRI_COUNT0_PADDING;
+
+	return &cp->pri_to_cpu.counts[idx];
+}
+
+/*
+ * Get pointer to mask for given priority index.
+ */
+static inline cpumask_var_t cpupri_mask(struct cpupri *cp, int idx)
+{
+	return cp->pri_to_cpu.masks[idx];
+}
+
 static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 				struct cpumask *lowest_mask, int idx)
 {
-	struct cpupri_vec *vec  = &cp->pri_to_cpu[idx];
+	cpumask_var_t cpu_mask = cpupri_mask(cp, idx);
+
 	int skip = 0;
 
-	if (!atomic_read(&(vec)->count))
+	if (!atomic_read(cpupri_count(cp, idx)))
 		skip = 1;
 	/*
-	 * When looking at the vector, we need to read the counter,
+	 * When looking at the vector, we need to read the count,
 	 * do a memory barrier, then read the mask.
 	 *
 	 * Note: This is still all racy, but we can deal with it.
@@ -96,18 +118,18 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 	if (skip)
 		return 0;
 
-	if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids)
+	if (cpumask_any_and(&p->cpus_mask, cpu_mask) >= nr_cpu_ids)
 		return 0;
 
 	if (lowest_mask) {
-		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+		cpumask_and(lowest_mask, &p->cpus_mask, cpu_mask);
 		cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
 
 		/*
 		 * We have to ensure that we have at least one bit
 		 * still set in the array, since the map could have
 		 * been concurrently emptied between the first and
-		 * second reads of vec->mask.  If we hit this
+		 * second reads of cpu_mask.  If we hit this
 		 * condition, simply act as though we never hit this
 		 * priority level and continue on.
 		 */
@@ -227,23 +249,19 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 	 * cpu being missed by the priority loop in cpupri_find.
 	 */
 	if (likely(newpri != CPUPRI_INVALID)) {
-		struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
-
-		cpumask_set_cpu(cpu, vec->mask);
+		cpumask_set_cpu(cpu, cpupri_mask(cp, newpri));
 		/*
 		 * When adding a new vector, we update the mask first,
 		 * do a write memory barrier, and then update the count, to
 		 * make sure the vector is visible when count is set.
 		 */
 		smp_mb__before_atomic();
-		atomic_inc(&(vec)->count);
+		atomic_inc(cpupri_count(cp, newpri));
 		do_mb = 1;
 	}
 	if (likely(oldpri != CPUPRI_INVALID)) {
-		struct cpupri_vec *vec  = &cp->pri_to_cpu[oldpri];
-
 		/*
-		 * Because the order of modification of the vec->count
+		 * Because the order of modification of the cpu count
 		 * is important, we must make sure that the update
 		 * of the new prio is seen before we decrement the
 		 * old prio. This makes sure that the loop sees
@@ -252,18 +270,18 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 		 * priority, as that will trigger an rt pull anyway.
 		 *
 		 * We only need to do a memory barrier if we updated
-		 * the new priority vec.
+		 * cpu count or mask of the new priority.
 		 */
 		if (do_mb)
 			smp_mb__after_atomic();
 
 		/*
-		 * When removing from the vector, we decrement the counter first
+		 * When removing from the vector, we decrement the count first
 		 * do a memory barrier and then clear the mask.
 		 */
-		atomic_dec(&(vec)->count);
+		atomic_dec(cpupri_count(cp, oldpri));
 		smp_mb__after_atomic();
-		cpumask_clear_cpu(cpu, vec->mask);
+		cpumask_clear_cpu(cpu, cpupri_mask(cp, oldpri));
 	}
 
 	*currpri = newpri;
@@ -280,10 +298,8 @@ int cpupri_init(struct cpupri *cp)
 	int i;
 
 	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
-		struct cpupri_vec *vec = &cp->pri_to_cpu[i];
-
-		atomic_set(&vec->count, 0);
-		if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL))
+		atomic_set(cpupri_count(cp, i), 0);
+		if (!zalloc_cpumask_var(&cp->pri_to_cpu.masks[i], GFP_KERNEL))
 			goto cleanup;
 	}
 
@@ -298,7 +314,7 @@ int cpupri_init(struct cpupri *cp)
 
 cleanup:
 	for (i--; i >= 0; i--)
-		free_cpumask_var(cp->pri_to_cpu[i].mask);
+		free_cpumask_var(cpupri_mask(cp, i));
 	return -ENOMEM;
 }
 
@@ -312,5 +328,5 @@ void cpupri_cleanup(struct cpupri *cp)
 
 	kfree(cp->cpu_to_pri);
 	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++)
-		free_cpumask_var(cp->pri_to_cpu[i].mask);
+		free_cpumask_var(cpupri_mask(cp, i));
 }
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..9041e2ffb3f3 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -1,5 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 
+#include <linux/cache.h>
+
 #define CPUPRI_NR_PRIORITIES	(MAX_RT_PRIO+1)
 
 #define CPUPRI_INVALID		-1
@@ -7,13 +9,73 @@
 /* values 1-99 are for RT1-RT99 priorities */
 #define CPUPRI_HIGHER		100
 
+/*
+ * Padding to isolate counts[0] (CPUPRI_NORMAL) into its own cacheline.
+ *
+ * On 64-byte cacheline systems: (64 / 4) - 1 = 15 padding slots
+ * This places counts[0] alone in cacheline 0, counts[1..N] in subsequent lines.
+ */
+#define CPUPRI_COUNT0_PADDING	((SMP_CACHE_BYTES / sizeof(atomic_t)) - 1)
+
+/* Total count array size including padding after counts[0] */
+#define CPUPRI_COUNT_ARRAY_SIZE	(CPUPRI_NR_PRIORITIES + CPUPRI_COUNT0_PADDING)
+
+/*
+ * Padding bytes to align mask vector to cacheline boundary.
+ * Ensures no false sharing between counts[] and masks[].
+ */
+#define CPUPRI_VEC_PADDING \
+	(SMP_CACHE_BYTES - \
+	 (CPUPRI_COUNT_ARRAY_SIZE * sizeof(atomic_t) % SMP_CACHE_BYTES))
+
 struct cpupri_vec {
-	atomic_t		count;
-	cpumask_var_t		mask;
+	/*
+	 * Count vector with strategic padding to prevent false sharing.
+	 *
+	 * Layout (64-byte cachelines):
+	 *   Cacheline 0: counts[0] (CPUPRI_NORMAL) + 60 bytes padding
+	 *   Cacheline 1+: counts[1..100] (RT priorities 1-99, CPUPRI_HIGHER)
+	 *
+	 * counts[0] experiences the heaviest read and write traffic:
+	 * - Write: RT task lifecycle operations (enqueue on empty runqueue,
+	     dequeue from non-overloaded runqueue) frequently update the
+	     normal priority count.
+	 * - Read: RT tasks searching for available CPUs scan from low to high
+	     priority, with counts[0] being checked at the start of every
+	     search iteration.
+	 * Isolating counts[0] in its own cacheline prevents contention with other
+	 * priority counts during concurrent search and update operations.
+	 */
+	atomic_t		counts[CPUPRI_COUNT_ARRAY_SIZE];
+
+	/*
+	 * Padding to separate count and mask vectors.
+	 *
+	 * Prevents false sharing between:
+	 * - counts[] (read-write, hot path in cpupri_set)
+	 * - masks[] (read-mostly, accessed in cpupri_find)
+	 */
+	char			padding[CPUPRI_VEC_PADDING];
+
+	/*
+	 * CPU mask vector.
+	 *
+	 * Either stores:
+	 * - Pointers to dynamically allocated cpumasks (read-mostly after init)
+	 * - Inline cpumasks (if !CPUMASK_OFFSTACK)
+	 */
+	cpumask_var_t		masks[CPUPRI_NR_PRIORITIES];
 };
 
 struct cpupri {
-	struct cpupri_vec	pri_to_cpu[CPUPRI_NR_PRIORITIES];
+	/*
+	 * Priority-to-CPU mapping.
+	 *
+	 * Single cpupri_vec structure containing all counts and masks,
+	 * rather than 101 separate cpupri_vec elements. This reduces
+	 * memory overhead from ~26 to ~21 cachelines.
+	 */
+	struct cpupri_vec       pri_to_cpu;
 	int			*cpu_to_pri;
 };
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 475bb5998295..2263237cdeb0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1014,7 +1014,7 @@ struct root_domain {
 	 * one runnable RT task.
 	 */
 	cpumask_var_t		rto_mask;
-	struct cpupri		cpupri;
+	struct cpupri		cpupri	____cacheline_aligned;
 
 	/*
 	 * NULL-terminated list of performance domains intersecting with the

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-03-24 12:00       ` Peter Zijlstra
@ 2026-03-31  5:37         ` Chen, Yu C
  2026-03-31 10:19           ` K Prateek Nayak
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Yu C @ 2026-03-31  5:37 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li, kprateek.nayak

On 3/24/2026 8:00 PM, Peter Zijlstra wrote:
> On Mon, Mar 23, 2026 at 11:45:01AM -0700, Tim Chen wrote:
>> On Fri, 2026-03-20 at 13:40 +0100, Peter Zijlstra wrote:
>>> On Mon, Jul 21, 2025 at 02:10:26PM +0800, Pan Deng wrote:
>>>
>>>> This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
>>>> mitigate false sharing.
>>>
>>> So I really do think we need something here. We're running into the
>>> whole cpumask contention thing on a semi regular basis.
>>>

[ ... ]

> +
> +unsigned int sbm_find_next_bit(struct sbm *sbm, int start)
> +{
> +	struct sbm_leaf *leaf = (void *)sbm;
> +	struct sbm_root *root = (void *)sbm;
> +	int nr = start >> arch_sbm_shift;
> +	int bit = start & arch_sbm_mask;
> +	unsigned long tmp, mask = (~0UL) << bit;
> +	if (sbm->type == st_root) {
> +		for (; nr < arch_sbm_leafs; nr++, mask = ~0UL) {
> +			leaf = root->leafs[nr];
> +			tmp = leaf->bitmap & mask;
> +			if (!tmp)
> +				continue;

I suppose this should be
	if (tmp)
		break;
otherwise
[   40.071616] watchdog: BUG: soft lockup - CPU#0 stuck for 30s! 
[swapper/0:0]
[   40.071616] Modules linked in:
[   40.071616] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 
7.0.0-rc5-sbm-+ #16 PREEMPT(full)
[   40.071616] RIP: 0010:sbm_find_next_bit+0x2a/0xa0

> +		}
> +	} else {
> +		tmp = leaf->bitmap & mask;
> +	}
> +	if (!tmp)
> +		return -1;
> +	return (nr << arch_sbm_shift) | __ffs(tmp);
> +}

update of the test:
With above change, I did a simple hackbench test on
a system with multiple LLCs within 1 node, so the benefit
is significant(+12%~+30%) when system is under-loaded, while
some regression when overloaded(-10%)(need to figure out)

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-03-31  5:37         ` Chen, Yu C
@ 2026-03-31 10:19           ` K Prateek Nayak
  2026-04-02  3:15             ` Chen, Yu C
  0 siblings, 1 reply; 41+ messages in thread
From: K Prateek Nayak @ 2026-03-31 10:19 UTC (permalink / raw)
  To: Chen, Yu C, Peter Zijlstra, Tim Chen
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li

Hello Chenyu,

On 3/31/2026 11:07 AM, Chen, Yu C wrote:
> update of the test:
> With above change, I did a simple hackbench test on
> a system with multiple LLCs within 1 node, so the benefit
> is significant(+12%~+30%) when system is under-loaded, while
> some regression when overloaded(-10%)(need to figure out)

Could it be because of how we are traversing the CPUs now for idle load
balancing? Since we use the first set bit for ilb_cpu and also staring
balancing from that very CPu, we might just stop after a successful
balance on the ilb_cpu.

Would something like below on top of Peter's suggestion + your fix help?

  (lightly tested; Has survived sched messaging on baremetal)

diff --git a/include/linux/sbm.h b/include/linux/sbm.h
index 8beade6c0585..98c4c1866534 100644
--- a/include/linux/sbm.h
+++ b/include/linux/sbm.h
@@ -76,8 +76,45 @@ static inline bool sbm_cpu_test(struct sbm *sbm, int cpu)
 	return __sbm_op(sbm, test_bit);
 }
 
+static __always_inline
+unsigned int sbm_find_next_bit_wrap(struct sbm *sbm, int start)
+{
+	int bit = sbm_find_next_bit(sbm, start);
+
+	if (bit >= 0 || start == 0)
+		return bit;
+
+	bit = sbm_find_next_bit(sbm, 0);
+	return bit < start ? bit : -1;
+}
+
+static __always_inline
+unsigned int __sbm_for_each_wrap(struct sbm *sbm, int start, int n)
+{
+	int bit;
+
+	/* If not wrapped around */
+	if (n > start) {
+		/* and have a bit, just return it. */
+		bit = sbm_find_next_bit(sbm, n);
+		if (bit >= 0)
+			return bit;
+
+		/* Otherwise, wrap around and ... */
+		n = 0;
+	}
+
+	/* Search the other part. */
+	bit = sbm_find_next_bit(sbm, n);
+	return bit < start ? bit : -1;
+}
+
 #define sbm_for_each_set_bit(sbm, idx) \
 	for (int idx = sbm_find_next_bit(sbm, 0); \
 	     idx >= 0; idx = sbm_find_next_bit(sbm, idx+1))
 
+#define sbm_for_each_set_bit_wrap(sbm, idx, start) \
+	for (int idx = sbm_find_next_bit_wrap(sbm, start); \
+	     idx >= 0; idx = __sbm_for_each_wrap(sbm, start, idx+1))
+
 #endif /* _LINUX_SBM_H */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a3a423c4706e..f485afb6286d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12916,6 +12916,7 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 	int this_cpu = this_rq->cpu;
 	int balance_cpu;
 	struct rq *rq;
+	u32 start;
 
 	WARN_ON_ONCE((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
 
@@ -12944,7 +12945,8 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
 	 * Start with the next CPU after this_cpu so we will end with this_cpu and let a
 	 * chance for other idle cpu to pull load.
 	 */
-	sbm_for_each_set_bit(nohz.sbm, idx) {
+	start = arch_sbm_cpu_to_idx((this_cpu + 1) % nr_cpu_ids);
+	sbm_for_each_set_bit_wrap(nohz.sbm, idx, start) {
 		balance_cpu = arch_sbm_idx_to_cpu(idx);
 
 		if (!idle_cpu(balance_cpu))
---

This is pretty much giving me similar performance as tip for sched
messaging runs under heavy load but your mileage may vary :-)

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-03-31 10:19           ` K Prateek Nayak
@ 2026-04-02  3:15             ` Chen, Yu C
  2026-04-02  4:41               ` K Prateek Nayak
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Yu C @ 2026-04-02  3:15 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra, Tim Chen
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li

Hello Prateek,

On 3/31/2026 6:19 PM, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 3/31/2026 11:07 AM, Chen, Yu C wrote:
>> update of the test:
>> With above change, I did a simple hackbench test on
>> a system with multiple LLCs within 1 node, so the benefit
>> is significant(+12%~+30%) when system is under-loaded, while
>> some regression when overloaded(-10%)(need to figure out)
> 
> Could it be because of how we are traversing the CPUs now for idle load
> balancing? Since we use the first set bit for ilb_cpu and also staring
> balancing from that very CPu, we might just stop after a successful
> balance on the ilb_cpu.
> 
> Would something like below on top of Peter's suggestion + your fix help?
> 
>    (lightly tested; Has survived sched messaging on baremetal)
> 
> diff --git a/include/linux/sbm.h b/include/linux/sbm.h
> index 8beade6c0585..98c4c1866534 100644
> --- a/include/linux/sbm.h
> +++ b/include/linux/sbm.h
> @@ -76,8 +76,45 @@ static inline bool sbm_cpu_test(struct sbm *sbm, int cpu)
>   	return __sbm_op(sbm, test_bit);
>   }
>   
> +static __always_inline
> +unsigned int sbm_find_next_bit_wrap(struct sbm *sbm, int start)
> +{
> +	int bit = sbm_find_next_bit(sbm, start);
> +
> +	if (bit >= 0 || start == 0)
> +		return bit;
> +
> +	bit = sbm_find_next_bit(sbm, 0);
> +	return bit < start ? bit : -1;
> +}
> +
> +static __always_inline
> +unsigned int __sbm_for_each_wrap(struct sbm *sbm, int start, int n)
> +{
> +	int bit;
> +
> +	/* If not wrapped around */
> +	if (n > start) {
> +		/* and have a bit, just return it. */
> +		bit = sbm_find_next_bit(sbm, n);
> +		if (bit >= 0)
> +			return bit;
> +
> +		/* Otherwise, wrap around and ... */
> +		n = 0;
> +	}
> +
> +	/* Search the other part. */
> +	bit = sbm_find_next_bit(sbm, n);
> +	return bit < start ? bit : -1;
> +}
> +
>   #define sbm_for_each_set_bit(sbm, idx) \
>   	for (int idx = sbm_find_next_bit(sbm, 0); \
>   	     idx >= 0; idx = sbm_find_next_bit(sbm, idx+1))
>   
> +#define sbm_for_each_set_bit_wrap(sbm, idx, start) \
> +	for (int idx = sbm_find_next_bit_wrap(sbm, start); \
> +	     idx >= 0; idx = __sbm_for_each_wrap(sbm, start, idx+1))
> +
>   #endif /* _LINUX_SBM_H */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a3a423c4706e..f485afb6286d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12916,6 +12916,7 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
>   	int this_cpu = this_rq->cpu;
>   	int balance_cpu;
>   	struct rq *rq;
> +	u32 start;
>   
>   	WARN_ON_ONCE((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>   
> @@ -12944,7 +12945,8 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
>   	 * Start with the next CPU after this_cpu so we will end with this_cpu and let a
>   	 * chance for other idle cpu to pull load.
>   	 */
> -	sbm_for_each_set_bit(nohz.sbm, idx) {
> +	start = arch_sbm_cpu_to_idx((this_cpu + 1) % nr_cpu_ids);
> +	sbm_for_each_set_bit_wrap(nohz.sbm, idx, start) {
>   		balance_cpu = arch_sbm_idx_to_cpu(idx);
>   
>   		if (!idle_cpu(balance_cpu))
> ---
> 
> This is pretty much giving me similar performance as tip for sched
> messaging runs under heavy load but your mileage may vary :-)
> 

Thanks very much for providing this optimization. It should help
more nohz idle CPUs-beyond just the currently selected ilb_cpu
to assist in offloading work. When I applied this patch and reran
the test, it appeared to introduce some regressions (underload and
overload) compared to the baseline without Peter’s sbm applied.

One suspicion is that with sbm enabled(without your patch), more
tasks are "aggregated" onto the first CPU(or maybe the front part)
in nohz.sbm, because sbm_for_each_set_bit() always picks the first
idle CPU to pull work. As we already know, hackbench on our
platform strongly prefers being aggregated rather than being
spread across different LLCs. So with the spreading fix, the
hackbench might be put to different CPUs. Anyway, I'll run more
rounds of testing to check whether this is consistent or merely
due to run-to-run variance. And I'll try other workloads besides
hackbench. Or do you have suggestion on what workload we can try,
which is sensitive to nohz cpumask access(I chose hackbench because
I found Shrikanth was using hackbench for nohz evaluation in
commit 5d86d542f6)

thanks,
Chenyu


CPUs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-02  3:15             ` Chen, Yu C
@ 2026-04-02  4:41               ` K Prateek Nayak
  2026-04-02 10:55                 ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: K Prateek Nayak @ 2026-04-02  4:41 UTC (permalink / raw)
  To: Chen, Yu C, Peter Zijlstra, Tim Chen
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li

Hello Chenyu,

Thank you for testing the changes! Much appreciated.

On 4/2/2026 8:45 AM, Chen, Yu C wrote:
> One suspicion is that with sbm enabled(without your patch), more
> tasks are "aggregated" onto the first CPU(or maybe the front part)
> in nohz.sbm, because sbm_for_each_set_bit() always picks the first
> idle CPU to pull work. As we already know, hackbench on our
> platform strongly prefers being aggregated rather than being
> spread across different LLCs. So with the spreading fix, the
> hackbench might be put to different CPUs.

Ack! But I cannot seem to come up with a theory on why it would be any
worse than original.

P.S. what does your SBM log in the dmesg look like? On my 3rd Generation
EPYC machine (2 x 64C/128T) it looks like:

    CPU topo: SBM: shift(6) leafs(4) APIC(ff)

Now, I suppose I get 4 leaves because I have 128CPUs per socket
(2 x u64 per socket) but it is not super how it is achieved from
doing:

    arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1;

that divides TOPO_DIE_DOMAIN into two but that should only be okay
until 128CPUs per DIE.

It is still not super clear to me how the logic deals with more than
128CPUs in a DIE domain because that'll need more than the u64 but
sbm_find_next_bit() simply does:

    tmp = leaf->bitmap & mask; /* All are u64 */

expecting just the u64 bitmap to represent all the CPUs in the leaf.

If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask
as 7f (127) which allows a leaf to more than 64 CPUs but we are
using the "u64 bitmap" directly and not:

    find_next_bit(bitmap, arch_sbm_mask)

Am I missing something here?

AMD got the 0x80000026 for defining TOPO_DIE_DOMAIN as soon as we
crossed 256CPUs per socket in 4th Generation EPYC so it'll have per
CCD (up to 2LLCs) smb leaves but if I'm not mistaken, some of the
SPR systems still advertised one large TILE / DIE domain.

I'm curious if your test system exposed multiple DIE per PKG since
280 logical CPUs per socket based on the cover letter would still go
beyond needing 64 bits if it is advertised as a single DIE.

> Anyway, I'll run more
> rounds of testing to check whether this is consistent or merely
> due to run-to-run variance. And I'll try other workloads besides
> hackbench. Or do you have suggestion on what workload we can try,
> which is sensitive to nohz cpumask access(I chose hackbench because
> I found Shrikanth was using hackbench for nohz evaluation in
> commit 5d86d542f6)

Most sensitive is schbench's tail latency when system is fully
loaded (#workers = #CPUs) but that data point also has large run
to run variation - I generally look for crazy jumps like the tail
latency turning 5-8x consistently across multiple runs before
actually concluding it is a regression.

hackbench (/ sched-messaging) should be good enough from a
throughput standpoint.

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  2026-03-27 10:17         ` Deng, Pan
@ 2026-04-02 10:37           ` Deng, Pan
  2026-04-02 10:43           ` Peter Zijlstra
  1 sibling, 0 replies; 41+ messages in thread
From: Deng, Pan @ 2026-04-02 10:37 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt
  Cc: mingo@kernel.org, linux-kernel@vger.kernel.org, Li, Tianyou,
	tim.c.chen@linux.intel.com, Chen, Yu C

> +	atomic_t		counts[CPUPRI_COUNT_ARRAY_SIZE];
> +
> +	/*
> +	 * Padding to separate count and mask vectors.
> +	 *
> +	 * Prevents false sharing between:
> +	 * - counts[] (read-write, hot path in cpupri_set)
> +	 * - masks[] (read-mostly, accessed in cpupri_find)
> +	 */
> +	char			padding[CPUPRI_VEC_PADDING];
> +
> +	/*
> +	 * CPU mask vector.
> +	 *
> +	 * Either stores:
> +	 * - Pointers to dynamically allocated cpumasks (read-mostly after init)
> +	 * - Inline cpumasks (if !CPUMASK_OFFSTACK)
> +	 */
> +	cpumask_var_t		masks[CPUPRI_NR_PRIORITIES];
>  };
> 
>  struct cpupri {
> -	struct cpupri_vec	pri_to_cpu[CPUPRI_NR_PRIORITIES];
> +	/*
> +	 * Priority-to-CPU mapping.
> +	 *
> +	 * Single cpupri_vec structure containing all counts and masks,
> +	 * rather than 101 separate cpupri_vec elements. This reduces
> +	 * memory overhead from ~26 to ~21 cachelines.
> +	 */
> +	struct cpupri_vec       pri_to_cpu;
>  	int			*cpu_to_pri;
>  };
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 475bb5998295..2263237cdeb0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1014,7 +1014,7 @@ struct root_domain {
>  	 * one runnable RT task.
>  	 */
>  	cpumask_var_t		rto_mask;
> -	struct cpupri		cpupri;
> +	struct cpupri		cpupri	____cacheline_aligned;
> 
>  	/*
>  	 * NULL-terminated list of performance domains intersecting with the

Peter and Steven,

Here we consider two approaches:
The cache-line alignment approach is simple to implement but increases
memory usage.
The alternative approach (separating counts and masks, with padding after
counts[0]) reduces memory footprint at the cost of slightly higher complexity.
What is your opinion? thanks a lot!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  2026-03-27 10:17         ` Deng, Pan
  2026-04-02 10:37           ` Deng, Pan
@ 2026-04-02 10:43           ` Peter Zijlstra
  1 sibling, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2026-04-02 10:43 UTC (permalink / raw)
  To: Deng, Pan
  Cc: mingo@kernel.org, rostedt@goodmis.org,
	linux-kernel@vger.kernel.org, Li, Tianyou,
	tim.c.chen@linux.intel.com, Chen, Yu C

On Fri, Mar 27, 2026 at 10:17:13AM +0000, Deng, Pan wrote:
> > 
> > On Tue, Mar 24, 2026 at 09:36:14AM +0000, Deng, Pan wrote:
> > 
> > > Regarding this patch, yes, using cacheline aligned could increase potential
> > > memory usage.
> > > After internal discussion, we are thinking of an alternative method to
> > > mitigate the waste of memory usage, that is, using kmalloc() to allocate
> > > count in a different memory space rather than placing the count and
> > > cpumask together in this structure. The rationale is that, writing to
> > > address pointed by the counter and reading the address from cpumask
> > > is isolated in different memory space which could reduce the ratio of
> > > cache false sharing, besides, kmalloc() based on slub/slab could place
> > > the objects in different cache lines to reduce the cache contention.
> > > The drawback of dynamic allocation counter is that, we have to maintain
> > > the life cycle of the counters.
> > > Could you please advise if sticking with current cache_align attribute
> > > method or using kmalloc() is preferred?
> > 
> > Well, you'd have to allocate a full cacheline anyway. If you allocate N
> > 4 byte (counter) objects, there's a fair chance they end up in the same
> > cacheline (its a SLAB after all) and then you're back to having a ton of
> > false sharing.
> > 
> > Anyway, for you specific workload, why isn't partitioning a viable
> > solution? It would not need any kernel modifications and would get rid
> > of the contention entirely.
> 
> Thank you very much for pointing this out.
> 
> We understand cpuset partitioning would eliminate the contention.
> However, in managed container platforms (e.g., Kubernetes), users can
> obtain RT capabilities for their workloads via CAP_SYS_NICE, but they
> don't have host-level privileges to create cpuset partitions.

So because Kubernetes is shit, you're going to patch the kernel? Isn't
that backwards? Should you not instead try and fix this kubernetes
thing?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-02  4:41               ` K Prateek Nayak
@ 2026-04-02 10:55                 ` Peter Zijlstra
  2026-04-02 11:06                   ` K Prateek Nayak
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2026-04-02 10:55 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Chen, Yu C, Tim Chen, Pan Deng, mingo, linux-kernel, tianyou.li

On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote:

> It is still not super clear to me how the logic deals with more than
> 128CPUs in a DIE domain because that'll need more than the u64 but
> sbm_find_next_bit() simply does:
> 
>     tmp = leaf->bitmap & mask; /* All are u64 */
> 
> expecting just the u64 bitmap to represent all the CPUs in the leaf.
> 
> If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask
> as 7f (127) which allows a leaf to more than 64 CPUs but we are
> using the "u64 bitmap" directly and not:
> 
>     find_next_bit(bitmap, arch_sbm_mask)
> 
> Am I missing something here?

Nope. That logic just isn't there, that was left as an exercise to the
reader :-)

For AMD in particular it would be good to have one leaf per CCD, but
since CCD are not enumerated in your topology (they really should be), I
didn't do that.

Now, I seem to remember we had this discussion in the past some time,
and you had some hacks available.

Anyway, the whole premise was to have one leaf/cacheline per cache, such
that high frequency atomic ops set/clear bit, don't bounce the line
around.

I took the nohz bitmap, because it was relatively simple and is known to
suffer from contention under certain workloads.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-02 10:55                 ` Peter Zijlstra
@ 2026-04-02 11:06                   ` K Prateek Nayak
  2026-04-03  5:46                     ` Chen, Yu C
  0 siblings, 1 reply; 41+ messages in thread
From: K Prateek Nayak @ 2026-04-02 11:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chen, Yu C, Tim Chen, Pan Deng, mingo, linux-kernel, tianyou.li

Hello Peter,

On 4/2/2026 4:25 PM, Peter Zijlstra wrote:
> On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote:
> 
>> It is still not super clear to me how the logic deals with more than
>> 128CPUs in a DIE domain because that'll need more than the u64 but
>> sbm_find_next_bit() simply does:
>>
>>     tmp = leaf->bitmap & mask; /* All are u64 */
>>
>> expecting just the u64 bitmap to represent all the CPUs in the leaf.
>>
>> If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask
>> as 7f (127) which allows a leaf to more than 64 CPUs but we are
>> using the "u64 bitmap" directly and not:
>>
>>     find_next_bit(bitmap, arch_sbm_mask)
>>
>> Am I missing something here?
> 
> Nope. That logic just isn't there, that was left as an exercise to the
> reader :-)

Ack! Let me go fiddle with that.

> 
> For AMD in particular it would be good to have one leaf per CCD, but
> since CCD are not enumerated in your topology (they really should be), I
> didn't do that.

We got the extended topology leaf 0x80000026 since 4th Generation EPYC
and we (well Thomas) added the parser support in v6.10 [1] so we can
discover the CCD boundary using that now ;-)

https://lore.kernel.org/all/20240314050432.1710-1-kprateek.nayak@amd.com/

> 
> Now, I seem to remember we had this discussion in the past some time,
> and you had some hacks available.

That, I believe, was for the NPS boundaries that we don't expose in NPS1
but CCX should be good enough.

> 
> Anyway, the whole premise was to have one leaf/cacheline per cache, such
> that high frequency atomic ops set/clear bit, don't bounce the line
> around.
> 
> I took the nohz bitmap, because it was relatively simple and is known to
> suffer from contention under certain workloads.

Ack! It would be better to tie it to the TOPO_TILE_DOMAIN then which
maps to the "CCX" on AMD and is the LLC boundary. CCD is just a
cluster of CCX that is nearby - mostly the dense core offerings
enumerate 2CCX per CCD.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-02 11:06                   ` K Prateek Nayak
@ 2026-04-03  5:46                     ` Chen, Yu C
  2026-04-03  8:13                       ` K Prateek Nayak
  2026-04-07 20:35                       ` Tim Chen
  0 siblings, 2 replies; 41+ messages in thread
From: Chen, Yu C @ 2026-04-03  5:46 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra
  Cc: Tim Chen, Pan Deng, mingo, linux-kernel, tianyou.li

On 4/2/2026 7:06 PM, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 4/2/2026 4:25 PM, Peter Zijlstra wrote:
>> On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote:
>>
>>> It is still not super clear to me how the logic deals with more than
>>> 128CPUs in a DIE domain because that'll need more than the u64 but
>>> sbm_find_next_bit() simply does:
>>>
>>>      tmp = leaf->bitmap & mask; /* All are u64 */
>>>
>>> expecting just the u64 bitmap to represent all the CPUs in the leaf.
>>>
>>> If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask
>>> as 7f (127) which allows a leaf to more than 64 CPUs but we are
>>> using the "u64 bitmap" directly and not:
>>>
>>>      find_next_bit(bitmap, arch_sbm_mask)
>>>
>>> Am I missing something here?
>>
>> Nope. That logic just isn't there, that was left as an exercise to the
>> reader :-)
> 
> Ack! Let me go fiddle with that.
> 

Nice catch. I hadn't noticed this since we have fewer than
64 CPUs per die. Please feel free to send patches to me when
they're available.

And regarding your other question about the calculation of arch_sbm_shift,
I'm trying to understand why there is a subtraction of 1, should it be:
-       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1;
+       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1];
?
Are we trying to filer the raw global unique die id? - similar to 
topo_apicid()
which mask the lower x86_topo_system.dom_shifts[dom - 1]).

With above change I can get a correct value of leaves (4) rather than (2) in
the original version.

thanks,
Chenyu




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-03  5:46                     ` Chen, Yu C
@ 2026-04-03  8:13                       ` K Prateek Nayak
  2026-04-07 20:35                       ` Tim Chen
  1 sibling, 0 replies; 41+ messages in thread
From: K Prateek Nayak @ 2026-04-03  8:13 UTC (permalink / raw)
  To: Chen, Yu C, Peter Zijlstra
  Cc: Tim Chen, Pan Deng, mingo, linux-kernel, tianyou.li

Hello Chenyu,

On 4/3/2026 11:16 AM, Chen, Yu C wrote:
> On 4/2/2026 7:06 PM, K Prateek Nayak wrote:
>> Hello Peter,
>>
>> On 4/2/2026 4:25 PM, Peter Zijlstra wrote:
>>> On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote:
>>>
>>>> It is still not super clear to me how the logic deals with more than
>>>> 128CPUs in a DIE domain because that'll need more than the u64 but
>>>> sbm_find_next_bit() simply does:
>>>>
>>>>      tmp = leaf->bitmap & mask; /* All are u64 */
>>>>
>>>> expecting just the u64 bitmap to represent all the CPUs in the leaf.
>>>>
>>>> If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask
>>>> as 7f (127) which allows a leaf to more than 64 CPUs but we are
>>>> using the "u64 bitmap" directly and not:
>>>>
>>>>      find_next_bit(bitmap, arch_sbm_mask)
>>>>
>>>> Am I missing something here?
>>>
>>> Nope. That logic just isn't there, that was left as an exercise to the
>>> reader :-)
>>
>> Ack! Let me go fiddle with that.
>>
> 
> Nice catch. I hadn't noticed this since we have fewer than
> 64 CPUs per die. Please feel free to send patches to me when
> they're available.
> 
> And regarding your other question about the calculation of arch_sbm_shift,
> I'm trying to understand why there is a subtraction of 1, should it be:
> -       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1;
> +       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1];
> ?
> Are we trying to filer the raw global unique die id? - similar to topo_apicid()
> which mask the lower x86_topo_system.dom_shifts[dom - 1]).
> 
> With above change I can get a correct value of leaves (4) rather than (2) in
> the original version.

Thanks for confirming. I guess that would just be TOPO_TILE_DOMAIN then
and would work well on AMD too since that is where the CCX is mapped.
I'll get hold of a SPR / use a VM to confirm with 0x1f behavior.

I'll post the patches next week since I have to check with Andrea on how
the ARM systems have decided to number their SMT threads and whether
they requires separate plumbing for arch_sbm_idx_to_cpu(),
arch_sbm_cpu_to_idx() or not.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-03  5:46                     ` Chen, Yu C
  2026-04-03  8:13                       ` K Prateek Nayak
@ 2026-04-07 20:35                       ` Tim Chen
  2026-04-08  3:06                         ` K Prateek Nayak
  2026-04-08  9:25                         ` Chen, Yu C
  1 sibling, 2 replies; 41+ messages in thread
From: Tim Chen @ 2026-04-07 20:35 UTC (permalink / raw)
  To: Chen, Yu C, K Prateek Nayak, Peter Zijlstra
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li

On Fri, 2026-04-03 at 13:46 +0800, Chen, Yu C wrote:
> On 4/2/2026 7:06 PM, K Prateek Nayak wrote:
> > Hello Peter,
> > 
> > On 4/2/2026 4:25 PM, Peter Zijlstra wrote:
> > > On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote:
> > > 
> > > > It is still not super clear to me how the logic deals with more than
> > > > 128CPUs in a DIE domain because that'll need more than the u64 but
> > > > sbm_find_next_bit() simply does:
> > > > 
> > > >      tmp = leaf->bitmap & mask; /* All are u64 */
> > > > 
> > > > expecting just the u64 bitmap to represent all the CPUs in the leaf.
> > > > 
> > > > If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask
> > > > as 7f (127) which allows a leaf to more than 64 CPUs but we are
> > > > using the "u64 bitmap" directly and not:
> > > > 
> > > >      find_next_bit(bitmap, arch_sbm_mask)
> > > > 
> > > > Am I missing something here?
> > > 
> > > Nope. That logic just isn't there, that was left as an exercise to the
> > > reader :-)
> > 
> > Ack! Let me go fiddle with that.
> > 
> 
> Nice catch. I hadn't noticed this since we have fewer than
> 64 CPUs per die. Please feel free to send patches to me when
> they're available.
> 
> And regarding your other question about the calculation of arch_sbm_shift,
> I'm trying to understand why there is a subtraction of 1, should it be:
> -       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1;
> +       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1];

Perhaps something like

	arch_sbm_shift = min(sizeof(unsigned long),
			     topology_get_domain_shift(TOPO_TILE_DOMAIN));

to take care of both AMD system and the 64 bit leaf bitmask limit?

Tim

> ?
> Are we trying to filer the raw global unique die id? - similar to 
> topo_apicid()
> which mask the lower x86_topo_system.dom_shifts[dom - 1]).
> 
> With above change I can get a correct value of leaves (4) rather than (2) in
> the original version.
> 
> thanks,
> Chenyu
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-07 20:35                       ` Tim Chen
@ 2026-04-08  3:06                         ` K Prateek Nayak
  2026-04-08 11:35                           ` Chen, Yu C
  2026-04-08  9:25                         ` Chen, Yu C
  1 sibling, 1 reply; 41+ messages in thread
From: K Prateek Nayak @ 2026-04-08  3:06 UTC (permalink / raw)
  To: Tim Chen, Chen, Yu C, Peter Zijlstra
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li

Hello Tim,

On 4/8/2026 2:05 AM, Tim Chen wrote:
>> And regarding your other question about the calculation of arch_sbm_shift,
>> I'm trying to understand why there is a subtraction of 1, should it be:
>> -       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1;
>> +       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1];
> 
> Perhaps something like
> 
> 	arch_sbm_shift = min(sizeof(unsigned long),
> 			     topology_get_domain_shift(TOPO_TILE_DOMAIN));
> 
> to take care of both AMD system and the 64 bit leaf bitmask limit?

Ack! But do we want to separate CPUs on same LLC domain across
different cachelines in 64 CPU chunks or should we use the rest
of the padding to represent them?

I'm collecting some performance numbers to see if makes any
difference under high contention but have you seen benefits of
sharding the mask further when there are hundreds of CPU on the
same LLC?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-07 20:35                       ` Tim Chen
  2026-04-08  3:06                         ` K Prateek Nayak
@ 2026-04-08  9:25                         ` Chen, Yu C
  2026-04-08 16:47                           ` Tim Chen
  1 sibling, 1 reply; 41+ messages in thread
From: Chen, Yu C @ 2026-04-08  9:25 UTC (permalink / raw)
  To: Tim Chen
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li, K Prateek Nayak,
	Peter Zijlstra

On 4/8/2026 4:35 AM, Tim Chen wrote:
> On Fri, 2026-04-03 at 13:46 +0800, Chen, Yu C wrote:
>> On 4/2/2026 7:06 PM, K Prateek Nayak wrote:
>>> Hello Peter,
>>>
>>> On 4/2/2026 4:25 PM, Peter Zijlstra wrote:
>>>> On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote:
>>>>
>>>>> It is still not super clear to me how the logic deals with more than
>>>>> 128CPUs in a DIE domain because that'll need more than the u64 but
>>>>> sbm_find_next_bit() simply does:
>>>>>
>>>>>       tmp = leaf->bitmap & mask; /* All are u64 */
>>>>>
>>>>> expecting just the u64 bitmap to represent all the CPUs in the leaf.
>>>>>
>>>>> If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask
>>>>> as 7f (127) which allows a leaf to more than 64 CPUs but we are
>>>>> using the "u64 bitmap" directly and not:
>>>>>
>>>>>       find_next_bit(bitmap, arch_sbm_mask)
>>>>>
>>>>> Am I missing something here?
>>>>
>>>> Nope. That logic just isn't there, that was left as an exercise to the
>>>> reader :-)
>>>
>>> Ack! Let me go fiddle with that.
>>>
>>
>> Nice catch. I hadn't noticed this since we have fewer than
>> 64 CPUs per die. Please feel free to send patches to me when
>> they're available.
>>
>> And regarding your other question about the calculation of arch_sbm_shift,
>> I'm trying to understand why there is a subtraction of 1, should it be:
>> -       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1;
>> +       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1];
> 
> Perhaps something like
> 
> 	arch_sbm_shift = min(sizeof(unsigned long),
> 			     topology_get_domain_shift(TOPO_TILE_DOMAIN));
> 
> to take care of both AMD system and the 64 bit leaf bitmask limit?
> 

Yes, this should be doable (Prateek has mentioned using TOPO_TILE_DOMAIN).
The only drawback I can think of is that if there are more than 64 CPUs
within a die, it is possible CPUs in different dies (LLCs) be indexed in
the same leaf and access the same mask, which would still lead to cache
contention. Maybe we should allocate the leaf cpumask according to the
actual size of a die?

thanks,
Chenyu



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  2025-07-21  6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
  2026-03-20 10:09   ` Peter Zijlstra
@ 2026-04-08 10:16   ` Chen, Yu C
  2026-04-09 11:47     ` Deng, Pan
  1 sibling, 1 reply; 41+ messages in thread
From: Chen, Yu C @ 2026-04-08 10:16 UTC (permalink / raw)
  To: Pan Deng; +Cc: linux-kernel, tianyou.li, tim.c.chen, peterz, mingo

On 7/21/2025 2:10 PM, Pan Deng wrote:
> When running a multi-instance FFmpeg workload on an HCC system, significant
> cache line contention is observed around `cpupri_vec->count` and `mask` in
> struct root_domain.
> 
> The SUT is a 2-socket machine with 240 physical cores and 480 logical
> CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
> (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> with FIFO scheduling. FPS is used as score.
> 

[ ... ]

> As a result:
> - FPS improves by ~11%
> - Kernel cycles% drops from ~20% to ~11%
> - `count` and `mask` related cache line contention is mitigated, perf c2c
>    shows root_domain cache line 3 `cycles per load` drops from ~10K-59K
>    to ~0.5K-8K, cpupri's last cache line no longer appears in the report.
> - stress-ng cyclic benchmark is improved ~31.4%, command:
>    stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo   \
>                        --timeout 30 --minimize --metrics
> - rt-tests/pi_stress is improved ~76.5%, command:
>    rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))
> 

According to your test results above, this original proposal seems
simple enough. It provides a general benefit, not only for FFmpeg workloads
with "unusual" CPU affinity settings, but also for other common workloads
that do not use CPU affinity or partitioning.
I still prefer this proposal. Later we can rebase patch 4 on top of sbm
to see if it brings further improvements. patch 1 and patch 4 could form a
patch series IMHO.

thanks,
Chenyu

> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> index d6cba0020064..245b0fa626be 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -9,7 +9,7 @@
>   
>   struct cpupri_vec {
>   	atomic_t		count;
> -	cpumask_var_t		mask;
> +	cpumask_var_t		mask	____cacheline_aligned;
>   };
>   
>   struct cpupri {

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-08  3:06                         ` K Prateek Nayak
@ 2026-04-08 11:35                           ` Chen, Yu C
  2026-04-08 15:52                             ` K Prateek Nayak
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Yu C @ 2026-04-08 11:35 UTC (permalink / raw)
  To: K Prateek Nayak, Tim Chen, Peter Zijlstra
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li

Hello Prateek,

On 4/8/2026 11:06 AM, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 4/8/2026 2:05 AM, Tim Chen wrote:
>>> And regarding your other question about the calculation of arch_sbm_shift,
>>> I'm trying to understand why there is a subtraction of 1, should it be:
>>> -       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1;
>>> +       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1];
>>
>> Perhaps something like
>>
>> 	arch_sbm_shift = min(sizeof(unsigned long),
>> 			     topology_get_domain_shift(TOPO_TILE_DOMAIN));
>>
>> to take care of both AMD system and the 64 bit leaf bitmask limit?
> 
> Ack! But do we want to separate CPUs on same LLC domain across
> different cachelines in 64 CPU chunks or should we use the rest
> of the padding to represent them?
> 

I just saw your email and I had the same question.

> I'm collecting some performance numbers to see if makes any
> difference under high contention but have you seen benefits of
> sharding the mask further when there are hundreds of CPU on the
> same LLC?
> 

We haven't tried breaking it down further. One possible approach
is to partition it at L2 scope, the benefit of which may depend on
the workload.


thanks,
Chenyu

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-08 11:35                           ` Chen, Yu C
@ 2026-04-08 15:52                             ` K Prateek Nayak
  2026-04-09  5:17                               ` K Prateek Nayak
  0 siblings, 1 reply; 41+ messages in thread
From: K Prateek Nayak @ 2026-04-08 15:52 UTC (permalink / raw)
  To: Chen, Yu C, Tim Chen, Peter Zijlstra
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li

Hello Chenyu,

On 4/8/2026 5:05 PM, Chen, Yu C wrote:
> We haven't tried breaking it down further. One possible approach
> is to partition it at L2 scope, the benefit of which may depend on
> the workload.

I fear at that point we'll have too many cachelines and too much
cache pollution when the CPU starts reading this at tick to schedule
a newidle balance.

A 128 core system would bring in 128 * 64B = 8kB worth of data to
traverse the mask and at that point it becomes a trade off between
how fast you want reads vs writes and does it even speed up writes
after a certain point?

Sorry I got distracted by some other stuff today but I'll share the
results from my experiments tomorrow.

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-08  9:25                         ` Chen, Yu C
@ 2026-04-08 16:47                           ` Tim Chen
  0 siblings, 0 replies; 41+ messages in thread
From: Tim Chen @ 2026-04-08 16:47 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li, K Prateek Nayak,
	Peter Zijlstra

On Wed, 2026-04-08 at 17:25 +0800, Chen, Yu C wrote:
> On 4/8/2026 4:35 AM, Tim Chen wrote:
> > On Fri, 2026-04-03 at 13:46 +0800, Chen, Yu C wrote:
> > > On 4/2/2026 7:06 PM, K Prateek Nayak wrote:
> > > > Hello Peter,
> > > > 
> > > > On 4/2/2026 4:25 PM, Peter Zijlstra wrote:
> > > > > On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote:
> > > > > 
> > > > > > It is still not super clear to me how the logic deals with more than
> > > > > > 128CPUs in a DIE domain because that'll need more than the u64 but
> > > > > > sbm_find_next_bit() simply does:
> > > > > > 
> > > > > >       tmp = leaf->bitmap & mask; /* All are u64 */
> > > > > > 
> > > > > > expecting just the u64 bitmap to represent all the CPUs in the leaf.
> > > > > > 
> > > > > > If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask
> > > > > > as 7f (127) which allows a leaf to more than 64 CPUs but we are
> > > > > > using the "u64 bitmap" directly and not:
> > > > > > 
> > > > > >       find_next_bit(bitmap, arch_sbm_mask)
> > > > > > 
> > > > > > Am I missing something here?
> > > > > 
> > > > > Nope. That logic just isn't there, that was left as an exercise to the
> > > > > reader :-)
> > > > 
> > > > Ack! Let me go fiddle with that.
> > > > 
> > > 
> > > Nice catch. I hadn't noticed this since we have fewer than
> > > 64 CPUs per die. Please feel free to send patches to me when
> > > they're available.
> > > 
> > > And regarding your other question about the calculation of arch_sbm_shift,
> > > I'm trying to understand why there is a subtraction of 1, should it be:
> > > -       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1;
> > > +       arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1];
> > 
> > Perhaps something like
> > 
> > 	arch_sbm_shift = min(sizeof(unsigned long),
> > 			     topology_get_domain_shift(TOPO_TILE_DOMAIN));
> > 
> > to take care of both AMD system and the 64 bit leaf bitmask limit?
> > 
> 
> Yes, this should be doable (Prateek has mentioned using TOPO_TILE_DOMAIN).
> The only drawback I can think of is that if there are more than 64 CPUs
> within a die, it is possible CPUs in different dies (LLCs) be indexed in
> the same leaf and access the same mask, 
> 

First, I think I should have used 
	arch_sbm_shift = min(BITS_PER_LONG,
			     topology_get_domain_shift(TOPO_TILE_DOMAIN));


I am assuming that we should choose TOPO_DIE_DOMAIN for Intel CPUs and
TOPO_TILE_DOMAIN for AMD CPUs. And the assumption is that such domain
choice will span one L3 (I think that's the case). 

Then leaf domains smaller than the
domain size will also only span one L3 by definition.  So for the 128 CPUs
example you gave, both leaves with CPU
 0-63 and 64-127 will span the same LLC and we should not have cache
bounce.

Tim


> which would still lead to cache
> contention. Maybe we should allocate the leaf cpumask according to the
> actual size of a die?
> 
> thanks,
> Chenyu
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-08 15:52                             ` K Prateek Nayak
@ 2026-04-09  5:17                               ` K Prateek Nayak
  2026-04-09 23:09                                 ` Tim Chen
  0 siblings, 1 reply; 41+ messages in thread
From: K Prateek Nayak @ 2026-04-09  5:17 UTC (permalink / raw)
  To: Chen, Yu C, Tim Chen, Peter Zijlstra
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li

Hello Chenyu, Tim,

On 4/8/2026 9:22 PM, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 4/8/2026 5:05 PM, Chen, Yu C wrote:
>> We haven't tried breaking it down further. One possible approach
>> is to partition it at L2 scope, the benefit of which may depend on
>> the workload.
> 
> I fear at that point we'll have too many cachelines and too much
> cache pollution when the CPU starts reading this at tick to schedule
> a newidle balance.
> 
> A 128 core system would bring in 128 * 64B = 8kB worth of data to
> traverse the mask and at that point it becomes a trade off between
> how fast you want reads vs writes and does it even speed up writes
> after a certain point?
> 
> Sorry I got distracted by some other stuff today but I'll share the
> results from my experiments tomorrow.

Here is some data from an experiments I ran on a 3rd Generation EPYC
system (2 socket x 64C/128T (8LLCs per socket)):

Experiment: Two threads pinned per-CPU on all CPUs yielding to each other
and are operating on some cpumask - one setting the current CPU on the
mask and other clearing the current CPU: Just an estimate of worst case
scenario is we have to do one modification per sched-switch.

I'm measuring total cycles taken for cpumask operations with following
variants:

                                            %cycles vs global mask operation

global mask                                     : 100.0000%  (var: 3.28%)
per-NUMA mask                                   :  32.9209%  (var: 7.77%)
per-LLC mask                                    :   1.2977%  (var: 4.85%)
per-LLC mask (u8 operation; no LOCK prefix)     :   0.4930%  (var: 0.83%)

per-NUMA split is 3X faster, per-LLC on this 16LLC machine is 77x faster
and since there is enough space in the cacheline we can use a u8 to set
and clear the CPu atomically without LOCK prefix and then do a >> 3 to
get the CPU index from set bit which is 202x faster.

If we use the u8 operations, we can only read 8CPUs per 8-byte load on
64-bit system but with per-LLC mask, we can scan all 16CPUs on the LLC
with one 8-byte read and and per-NUMA one requires two 8-byte reads to
scan the 128CPUs per socket.

I think per-LLC mask (or, as Tim suggested, 64CPUs per cacheline) is
a good tradeoff between the speedup vs amount of loads required to
piece together the full cpumask. Thoughts?

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  2026-04-08 10:16   ` Chen, Yu C
@ 2026-04-09 11:47     ` Deng, Pan
  0 siblings, 0 replies; 41+ messages in thread
From: Deng, Pan @ 2026-04-09 11:47 UTC (permalink / raw)
  To: Chen, Yu C, peterz@infradead.org, Steven Rostedt
  Cc: linux-kernel@vger.kernel.org, Li, Tianyou,
	tim.c.chen@linux.intel.com, mingo@kernel.org

> According to your test results above, this original proposal seems
> simple enough. It provides a general benefit, not only for FFmpeg workloads
> with "unusual" CPU affinity settings, but also for other common workloads
> that do not use CPU affinity or partitioning.

Yes, exactly. FFmpeg and K8s are just example scenarios - the optimization
benefits any workload with RT thread contention. For instance, running
cyclictest on a 2-socket, 384-logical-core system:

"cyclictest -t -i200 -h 32 -m -p 95 -q"

This patch reduces both mean and max latency by at least 40%.

> I still prefer this proposal. Later we can rebase patch 4 on top of sbm
> to see if it brings further improvements. patch 1 and patch 4 could form a
> patch series IMHO.

Thank you for the feedback. I agree that patch 1 and patch 4 work well
together. Regarding the sbm discussion: we've observed promising results
in our sbm experiments, and I believe rebasing patch 4 on top of sbm would
likely show further improvements beyond the per-NUMA implementation. I'll
try this once the sbm implementation stabilizes.

Per Peter's previous request, I'm planning to add comments like this:
    /*
      * Separate mask to a different cacheline to mitigate contention
      * between count (read-write) and mask (read-mostly when storing
      * pointers). This alignment increases root_domain size by ~11KB,
      * but eliminates cache line bouncing between cpupri_set() writers
      * and cpupri_find_fitness() readers under heavy RT workloads.
      *
      * Memory overhead considerations:
      * - Systems with cpuset partitions: each partition's root_domain is
      *   dynamically allocated (kalloc). The ~11KB overhead per partition
      *   scales with partition count, acceptable on servers using partitions.
      * - Systems without partitions: only the static def_root_domain incurs
      *   the overhead, which is manageable for typical use.
      *
      * Additionally, this cacheline alignment ensures cpupri starts at a
      * cacheline boundary, eliminating false sharing with root_domain's
      * preceding fields (rto_mask, rto_loop_next, rto_loop_start).
      */
      cpumask_var_t		mask	____cacheline_aligned_in_smp;

Since this optimization is independent of the sbm work, would it be possible
to review this patch first? That would allow the sbm-related improvements
(patch 4) to build on top of this foundation once they're ready.

Best Regards
Pan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-09  5:17                               ` K Prateek Nayak
@ 2026-04-09 23:09                                 ` Tim Chen
  2026-04-10  5:51                                   ` Chen, Yu C
  0 siblings, 1 reply; 41+ messages in thread
From: Tim Chen @ 2026-04-09 23:09 UTC (permalink / raw)
  To: K Prateek Nayak, Chen, Yu C, Peter Zijlstra
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li

On Thu, 2026-04-09 at 10:47 +0530, K Prateek Nayak wrote:
> Hello Chenyu, Tim,
> 
> On 4/8/2026 9:22 PM, K Prateek Nayak wrote:
> > Hello Chenyu,
> > 
> > On 4/8/2026 5:05 PM, Chen, Yu C wrote:
> > > We haven't tried breaking it down further. One possible approach
> > > is to partition it at L2 scope, the benefit of which may depend on
> > > the workload.
> > 
> > I fear at that point we'll have too many cachelines and too much
> > cache pollution when the CPU starts reading this at tick to schedule
> > a newidle balance.
> > 
> > A 128 core system would bring in 128 * 64B = 8kB worth of data to
> > traverse the mask and at that point it becomes a trade off between
> > how fast you want reads vs writes and does it even speed up writes
> > after a certain point?
> > 
> > Sorry I got distracted by some other stuff today but I'll share the
> > results from my experiments tomorrow.
> 
> Here is some data from an experiments I ran on a 3rd Generation EPYC
> system (2 socket x 64C/128T (8LLCs per socket)):
> 
> Experiment: Two threads pinned per-CPU on all CPUs yielding to each other
> and are operating on some cpumask - one setting the current CPU on the
> mask and other clearing the current CPU: Just an estimate of worst case
> scenario is we have to do one modification per sched-switch.
> 
> I'm measuring total cycles taken for cpumask operations with following
> variants:
> 
>                                             %cycles vs global mask operation
> 
> global mask                                     : 100.0000%  (var: 3.28%)
> per-NUMA mask                                   :  32.9209%  (var: 7.77%)
> per-LLC mask                                    :   1.2977%  (var: 4.85%)
> per-LLC mask (u8 operation; no LOCK prefix)     :   0.4930%  (var: 0.83%)
> 
> per-NUMA split is 3X faster, per-LLC on this 16LLC machine is 77x faster
> and since there is enough space in the cacheline we can use a u8 to set
> and clear the CPu atomically without LOCK prefix and then do a >> 3 to
> get the CPU index from set bit which is 202x faster.
> 
> If we use the u8 operations, we can only read 8CPUs per 8-byte load on
> 64-bit system but with per-LLC mask, we can scan all 16CPUs on the LLC
> with one 8-byte read and and per-NUMA one requires two 8-byte reads to
> scan the 128CPUs per socket.
> 
> I think per-LLC mask (or, as Tim suggested, 64CPUs per cacheline) is
> a good tradeoff between the speedup vs amount of loads required to
> piece together the full cpumask. Thoughts?

I agree that per-LLC mask is a good compromise between minimizing loads
and offer good speed ups.  I think we should get the LLC APICID
mask from 0x4 leaf (L1, L2, L3) instead of inferring from 0x1f leaf (Tile, Die ...etc)
for Intel.  And the cache leaf I think is 0x8000_001D leaf for AMD.
Those are parsed in cacheinfo code and we can get it from there.

Tim

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-09 23:09                                 ` Tim Chen
@ 2026-04-10  5:51                                   ` Chen, Yu C
  2026-04-10  6:02                                     ` K Prateek Nayak
  0 siblings, 1 reply; 41+ messages in thread
From: Chen, Yu C @ 2026-04-10  5:51 UTC (permalink / raw)
  To: Tim Chen, K Prateek Nayak, Peter Zijlstra
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li

Hi Prateek, Tim,

On 4/10/2026 7:09 AM, Tim Chen wrote:
> On Thu, 2026-04-09 at 10:47 +0530, K Prateek Nayak wrote:
>> Hello Chenyu, Tim,
>>
>> On 4/8/2026 9:22 PM, K Prateek Nayak wrote:
>>> Hello Chenyu,
>>>
>>> On 4/8/2026 5:05 PM, Chen, Yu C wrote:
>>>> We haven't tried breaking it down further. One possible approach
>>>> is to partition it at L2 scope, the benefit of which may depend on
>>>> the workload.
>>>
>>> I fear at that point we'll have too many cachelines and too much
>>> cache pollution when the CPU starts reading this at tick to schedule
>>> a newidle balance.
>>>
>>> A 128 core system would bring in 128 * 64B = 8kB worth of data to
>>> traverse the mask and at that point it becomes a trade off between
>>> how fast you want reads vs writes and does it even speed up writes
>>> after a certain point?
>>>
>>> Sorry I got distracted by some other stuff today but I'll share the
>>> results from my experiments tomorrow.
>>
>> Here is some data from an experiments I ran on a 3rd Generation EPYC
>> system (2 socket x 64C/128T (8LLCs per socket)):
>>
>> Experiment: Two threads pinned per-CPU on all CPUs yielding to each other
>> and are operating on some cpumask - one setting the current CPU on the
>> mask and other clearing the current CPU: Just an estimate of worst case
>> scenario is we have to do one modification per sched-switch.
>>
>> I'm measuring total cycles taken for cpumask operations with following
>> variants:
>>
>>                                              %cycles vs global mask operation
>>
>> global mask                                     : 100.0000%  (var: 3.28%)
>> per-NUMA mask                                   :  32.9209%  (var: 7.77%)
>> per-LLC mask                                    :   1.2977%  (var: 4.85%)
>> per-LLC mask (u8 operation; no LOCK prefix)     :   0.4930%  (var: 0.83%)
>>
>> per-NUMA split is 3X faster, per-LLC on this 16LLC machine is 77x faster
>> and since there is enough space in the cacheline we can use a u8 to set
>> and clear the CPu atomically without LOCK prefix and then do a >> 3 to
>> get the CPU index from set bit which is 202x faster.
>>
>> If we use the u8 operations, we can only read 8CPUs per 8-byte load on
>> 64-bit system but with per-LLC mask, we can scan all 16CPUs on the LLC
>> with one 8-byte read and and per-NUMA one requires two 8-byte reads to
>> scan the 128CPUs per socket.
>>
>> I think per-LLC mask (or, as Tim suggested, 64CPUs per cacheline) is
>> a good tradeoff between the speedup vs amount of loads required to
>> piece together the full cpumask. Thoughts?

Yes, making it per LLC should work well enough (for balancing) to
achieve optimal benefit. Let me run some similar tests to yours,plus
hackbench/schbench, to see what the results are.
BTW, on AMD systems, does the TILE domain always match the CCX where
L3 is shared? On Intel the DIE is not always mapped to a domain
where L3 is shared.

> 
> I agree that per-LLC mask is a good compromise between minimizing loads
> and offer good speed ups.  I think we should get the LLC APICID
> mask from 0x4 leaf (L1, L2, L3) instead of inferring from 0x1f leaf (Tile, Die ...etc)
> for Intel.  And the cache leaf I think is 0x8000_001D leaf for AMD.
> Those are parsed in cacheinfo code and we can get it from there.
> 

Yes, let me check how we can leverage the l3 id for that.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2026-04-10  5:51                                   ` Chen, Yu C
@ 2026-04-10  6:02                                     ` K Prateek Nayak
  0 siblings, 0 replies; 41+ messages in thread
From: K Prateek Nayak @ 2026-04-10  6:02 UTC (permalink / raw)
  To: Chen, Yu C, Tim Chen, Peter Zijlstra
  Cc: Pan Deng, mingo, linux-kernel, tianyou.li

Hello Chenyu, Tim,

On 4/10/2026 11:21 AM, Chen, Yu C wrote:
>>> I think per-LLC mask (or, as Tim suggested, 64CPUs per cacheline) is
>>> a good tradeoff between the speedup vs amount of loads required to
>>> piece together the full cpumask. Thoughts?
> 
> Yes, making it per LLC should work well enough (for balancing) to
> achieve optimal benefit. Let me run some similar tests to yours,plus
> hackbench/schbench, to see what the results are.
> BTW, on AMD systems, does the TILE domain always match the CCX where
> L3 is shared? On Intel the DIE is not always mapped to a domain
> where L3 is shared.

On AMD platforms that support the extended leaf 0x80000026, CCX is
always mapped to L3 and matched the data on 0x8000001D cache property
leaf for L3.

> 
>>
>> I agree that per-LLC mask is a good compromise between minimizing loads
>> and offer good speed ups.  I think we should get the LLC APICID
>> mask from 0x4 leaf (L1, L2, L3) instead of inferring from 0x1f leaf (Tile, Die ...etc)
>> for Intel.  And the cache leaf I think is 0x8000_001D leaf for AMD.
>> Those are parsed in cacheinfo code and we can get it from there.
>>
> 
> Yes, let me check how we can leverage the l3 id for that.

Ack! I think the cacheinfo is better for all this and is also compatible
with older systems that may nit have the extend topology enumeration
leaf. AMD only got it two generations ago and until that only cache
property leaf was used for marking the LLC (CCX) boundary.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2026-04-10  6:02 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-21  6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
2025-07-21  6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
2026-03-20 10:09   ` Peter Zijlstra
2026-03-24  9:36     ` Deng, Pan
2026-03-24 12:11       ` Peter Zijlstra
2026-03-27 10:17         ` Deng, Pan
2026-04-02 10:37           ` Deng, Pan
2026-04-02 10:43           ` Peter Zijlstra
2026-04-08 10:16   ` Chen, Yu C
2026-04-09 11:47     ` Deng, Pan
2025-07-21  6:10 ` [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
2026-03-20 10:18   ` Peter Zijlstra
2025-07-21  6:10 ` [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
2026-03-20 10:24   ` Peter Zijlstra
2026-03-23 18:09     ` Tim Chen
2026-03-24 12:16       ` Peter Zijlstra
2026-03-24 22:40         ` Tim Chen
2025-07-21  6:10 ` [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
2026-03-20 12:40   ` Peter Zijlstra
2026-03-23 18:45     ` Tim Chen
2026-03-24 12:00       ` Peter Zijlstra
2026-03-31  5:37         ` Chen, Yu C
2026-03-31 10:19           ` K Prateek Nayak
2026-04-02  3:15             ` Chen, Yu C
2026-04-02  4:41               ` K Prateek Nayak
2026-04-02 10:55                 ` Peter Zijlstra
2026-04-02 11:06                   ` K Prateek Nayak
2026-04-03  5:46                     ` Chen, Yu C
2026-04-03  8:13                       ` K Prateek Nayak
2026-04-07 20:35                       ` Tim Chen
2026-04-08  3:06                         ` K Prateek Nayak
2026-04-08 11:35                           ` Chen, Yu C
2026-04-08 15:52                             ` K Prateek Nayak
2026-04-09  5:17                               ` K Prateek Nayak
2026-04-09 23:09                                 ` Tim Chen
2026-04-10  5:51                                   ` Chen, Yu C
2026-04-10  6:02                                     ` K Prateek Nayak
2026-04-08  9:25                         ` Chen, Yu C
2026-04-08 16:47                           ` Tim Chen
2026-03-20  9:59 ` [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Peter Zijlstra
2026-03-20 12:50   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox