[PATCH 0/4] sched/rt: mitigate root_domain cache line contention

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] sched/rt: mitigate root_domain cache line contention
@ 2025-07-07  2:35 Pan Deng
  2025-07-07  2:35 ` [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Pan Deng @ 2025-07-07  2:35 UTC (permalink / raw)
  To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng

From: Deng Pan <pan.deng@intel.com>

When running multi-instance FFmpeg workload in cloud environment,
cache line contention is severe during the access to root_domain data
structures, which significantly degrades performance.

The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.

Profiling shows the kernel consumes ~20% of CPU cycles, which is
excessive in this scenario. The overhead primarily comes from RT task
scheduling functions like `cpupri_set`, `cpupri_find_fitness`,
`dequeue_pushable_task`, `enqueue_pushable_task`, `pull_rt_task`,
`__find_first_and_bit`, and `__bitmap_and`. This is due to read/write
contention on root_domain cache lines.

The `perf c2c` report, sorted by contention severity, reveals:

root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` is heavily loaded/stored,
   since counts[0] is more frequently updated than others along with a
   rt task enqueues an empty runq or dequeues from a non-overloaded runq.
- `rto_mask` is heavily loaded
- `rto_loop_next` and `rto_loop_start` are frequently stored
- `rto_push_work` and `rto_lock` are lightly accessed
- cycles per load: ~10K to 59K.

root_domain cache line 1:
- `rto_count` is frequently loaded/stored
- `overloaded` is heavily loaded
- cycles per load: ~2.8K to 44K

cpumask (bitmap) cache line of cpupri_vec->mask:
- bits are loaded during cpupri_find
- bits are stored during cpupri_set
- cycles per load: ~2.2K to 8.7K

The end cache line of cpupri:
- `cpupri_vec->count` and `mask` contends. The transcoding threads use
  rt pri 99, so that the contention occurs in the end.
- cycles per load: ~1.5K to 10.5K

According to above, we propose 4 patches to mitigate the contention.
Patch 1: Reorganize `cpupri_vec`, separate `count`, `mask` fields,
         reducing contention on root_domain cache line 3 and cpupri's
         last cache line.
Patch 2: Restructure `root_domain` structure to minimize contention of
         root_domain cache line 1 and 3 by reordering fields.
Patch 3: Split `root_domain->rto_count` to per-NUMA-node counters,
         reducing the contention on root_domain cache line 1.
Patch 4: Split `cpupri_vec->cpumask` to per-NUMA-node bitmaps, reducing
         load/store contention on the cpumask bitmap cache line.

Evaluation:

Performance improvements (FPS, relative to baseline):
- Patch 1: +11.0%
- Patch 2:  +5.0%
- Patch 3:  +4.0%
- Patch 4:  +3.8%

Kernel CPU cycle usage reduction:
- Patch 1: 20.0% -> 11.0%
- Patch 2: 20.0% -> 17.7%
- Patch 3: 20.0% -> 18.6%
- Patch 4: 20.0% -> 18.7%

Cycles per load reduction (by perf c2c report):
- Patch 1:
  - `root_domain` cache line 3:    10K–59K    ->  0.5K–8K
  - `cpupri` last cache line:      1.5K–10.5K ->  eliminated
- Patch 2:
  - `root_domain` cache line 1:    2.8K–44K   ->  2.1K–2.7K
  - `root_domain` cache line 3:    10K–59K    ->  eliminated
- Patch 3:
  - `root_domain` cache line 1:    2.8K–44K   ->  eliminated
- Patch 4:
  - `cpupri_vec->mask` cache line: 2.2K–8.7K  ->  0.5K–2.2K

Comments are appreciated.

Pan Deng (4):
  sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  sched/rt: Restructure root_domain to reduce cacheline contention
  sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce
    contention

 kernel/sched/cpupri.c   | 200 ++++++++++++++++++++++++++++++++++++----
 kernel/sched/cpupri.h   |   6 +-
 kernel/sched/rt.c       |  65 ++++++++++++-
 kernel/sched/sched.h    |  61 ++++++------
 kernel/sched/topology.c |   7 ++
 5 files changed, 291 insertions(+), 48 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  2025-07-07  2:35 [PATCH 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
@ 2025-07-07  2:35 ` Pan Deng
  2025-09-01  5:10   ` Chen, Yu C
  2025-07-07  2:35 ` [PATCH 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Pan Deng @ 2025-07-07  2:35 UTC (permalink / raw)
  To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng

When running a multi-instance FFmpeg workload on an HCC system, significant
cache line contention is observed around `cpupri_vec->count` and `mask` in
struct root_domain.

The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.

perf c2c tool reveals:
root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored
   and contends with other fields, since counts[0] is more frequently
   updated than others along with a rt task enqueues an empty runq or
   dequeues from a non-overloaded runq.
- cycles per load: ~10K to 59K

cpupri's last cache line:
- `cpupri_vec->count` and `mask` contends. The transcoding threads use
  rt pri 99, so that the contention occurs in the end.
- cycles per load: ~1.5K to 10.5K

This change mitigates `cpupri_vec->count`, `mask` related contentions by
separating each count and mask into different cache lines.

As a result:
- FPS improves by ~11%
- Kernel cycles% drops from ~20% to ~11%
- `count` and `mask` related cache line contention is mitigated, perf c2c
  shows root_domain cache line 3 `cycles per load` drops from ~10K-59K
  to ~0.5K-8K, cpupri's last cache line no longer appears in the report.

Note: The side effect of this change is that struct cpupri size is
increased from 26 cache lines to 203 cache lines.

An alternative approach could be separating `counts` and `masks` into 2
vectors in cpupri_vec (counts[] and masks[]), and add two paddings:
1. Between counts[0] and counts[1], since counts[0] is more frequently
   updated than others.
2. Between the two vectors, since counts[] is read-write access  while
   masks[] is read access when it stores pointers.

The alternative approach introduces the complexity of 31+/21- LoC changes,
it achieves almost the same performance, at the same time, struct cpupri
size is reduced from 26 cache lines to 21 cache lines.

Appendix:
1. Current layout of contended data structures:
struct root_domain {
    atomic_t                   refcount;             /*     0     4 */
    atomic_t                   rto_count;            /*     4     4 */
    struct callback_head rcu __attribute__((__aligned__(8)));/*8 16 */
    cpumask_var_t              span;                 /*    24     8 */
    cpumask_var_t              online;               /*    32     8 */
    bool                       overloaded;           /*    40     1 */
    bool                       overutilized;         /*    41     1 */
    /* XXX 6 bytes hole, try to pack */
    cpumask_var_t              dlo_mask;             /*    48     8 */
    atomic_t                   dlo_count;            /*    56     4 */
    /* XXX 4 bytes hole, try to pack */
    /* --- cacheline 1 boundary (64 bytes) --- */
    struct dl_bw               dl_bw;                /*    64    24 */
    struct cpudl               cpudl;                /*    88    24 */
    u64                        visit_gen;            /*   112     8 */
    struct irq_work            rto_push_work;        /*   120    32 */
    /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
    raw_spinlock_t             rto_lock;             /*   152     4 */
    int                        rto_loop;             /*   156     4 */
    int                        rto_cpu;              /*   160     4 */
    atomic_t                   rto_loop_next;        /*   164     4 */
    atomic_t                   rto_loop_start;       /*   168     4 */
    /* XXX 4 bytes hole, try to pack */
    cpumask_var_t              rto_mask;             /*   176     8 */
    struct cpupri              cpupri;               /*   184  1624 */
    /* --- cacheline 28 boundary (1792 bytes) was 16 bytes ago --- */
    struct perf_domain *       pd;                   /*  1808     8 */
    /* size: 1816, cachelines: 29, members: 21 */
    /* sum members: 1802, holes: 3, sum holes: 14 */
    /* forced alignments: 1 */
    /* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));

struct cpupri {
        struct cpupri_vec          pri_to_cpu[101];      /*     0  1616 */
        /* --- cacheline 25 boundary (1600 bytes) was 16 bytes ago --- */
        int *                      cpu_to_pri;           /*  1616     8 */

        /* size: 1624, cachelines: 26, members: 2 */
        /* last cacheline: 24 bytes */
};

struct cpupri_vec {
        atomic_t                   count;                /*     0     4 */

        /* XXX 4 bytes hole, try to pack */

        cpumask_var_t              mask;                 /*     8     8 */

        /* size: 16, cachelines: 1, members: 2 */
        /* sum members: 12, holes: 1, sum holes: 4 */
        /* last cacheline: 16 bytes */
};

2. Perf c2c report of root_domain cache line 3:
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 353       44       62    0xff14d42c400e3880
-------  -------  ------  ------  ------  ------  ------------------------
 0.00%    2.27%    0.00%  0x0     21683   6     __flush_smp_call_function_
 0.00%    2.27%    0.00%  0x0     22294   5     __flush_smp_call_function_
 0.28%    0.00%    0.00%  0x0     0       2     irq_work_queue_on
 0.28%    0.00%    0.00%  0x0     27824   4     irq_work_single
 0.00%    0.00%    1.61%  0x0     28151   6     irq_work_queue_on
 0.57%    0.00%    0.00%  0x18    21822   8     native_queued_spin_lock_sl
 0.28%    2.27%    0.00%  0x18    16101   10    native_queued_spin_lock_sl
 0.57%    0.00%    0.00%  0x18    33199   5     native_queued_spin_lock_sl
 0.00%    0.00%    1.61%  0x18    10908   32    _raw_spin_lock
 0.00%    0.00%    1.61%  0x18    59770   2     _raw_spin_lock
 0.00%    0.00%    1.61%  0x18    0       1     _raw_spin_unlock
 1.42%    0.00%    0.00%  0x20    12918   20    pull_rt_task
 0.85%    0.00%   25.81%  0x24    31123   199   pull_rt_task
 0.85%    0.00%    3.23%  0x24    38218   24    pull_rt_task
 0.57%    4.55%   19.35%  0x28    30558   207   pull_rt_task
 0.28%    0.00%    0.00%  0x28    55504   10    pull_rt_task
18.70%   18.18%    0.00%  0x30    26438   291   dequeue_pushable_task
17.28%   22.73%    0.00%  0x30    29347   281   enqueue_pushable_task
 1.70%    2.27%    0.00%  0x30    12819   31    enqueue_pushable_task
 0.28%    0.00%    0.00%  0x30    17726   18    dequeue_pushable_task
34.56%   29.55%    0.00%  0x38    25509   527   cpupri_find_fitness
13.88%   11.36%   24.19%  0x38    30654   342   cpupri_set
 3.12%    2.27%    0.00%  0x38    18093   39    cpupri_set
 1.70%    0.00%    0.00%  0x38    37661   52    cpupri_find_fitness
 1.42%    2.27%   19.35%  0x38    31110   211   cpupri_set
 1.42%    0.00%    1.61%  0x38    45035   31    cpupri_set

3. Perf c2c report of cpupri's last cache line
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 149       43       41    0xff14d42c400e3ec0
-------  -------  ------  ------  ------  ------  ------------------------
 8.72%   11.63%    0.00%  0x8     2001    165   cpupri_find_fitness
 1.34%    2.33%    0.00%  0x18    1456    151   cpupri_find_fitness
 8.72%    9.30%   58.54%  0x28    1744    263   cpupri_set
 2.01%    4.65%   41.46%  0x28    1958    301   cpupri_set
 1.34%    0.00%    0.00%  0x28    10580   6     cpupri_set
69.80%   67.44%    0.00%  0x30    1754    347   cpupri_set
 8.05%    4.65%    0.00%  0x30    2144    256   cpupri_set

Signed-off-by: Pan Deng <pan.deng@intel.com>
Signed-off-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/cpupri.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..245b0fa626be 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -9,7 +9,7 @@
 
 struct cpupri_vec {
 	atomic_t		count;
-	cpumask_var_t		mask;
+	cpumask_var_t		mask	____cacheline_aligned;
 };
 
 struct cpupri {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/4] sched/rt: Restructure root_domain to reduce cacheline contention
  2025-07-07  2:35 [PATCH 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
  2025-07-07  2:35 ` [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
@ 2025-07-07  2:35 ` Pan Deng
  2025-07-07  2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
  2025-07-07  2:35 ` [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
  3 siblings, 0 replies; 16+ messages in thread
From: Pan Deng @ 2025-07-07  2:35 UTC (permalink / raw)
  To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng

When running a multi-instance FFmpeg workload on HCC system, significant
contention is observed in root_domain cacheline 1 and 3.

The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.

perf c2c tool reveals (sorted by contention severity):
root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored,
   since counts[0] is more frequently updated than others along with a
   rt task enqueues an empty runq or dequeues from a non-overloaded runq.
- `rto_mask` (0x30) is heavily loaded
- `rto_loop_next` (0x24) and `rto_loop_start` (0x28) are frequently stored
- `rto_push_work` (0x0) and `rto_lock` (0x18) are lightly accessed
- cycles per load: ~10K to 59K

root_domain cache line 1:
- `rto_count` (0x4) is frequently loaded/stored
- `overloaded` (0x28) is heavily loaded
- cycles per load: ~2.8K to 44K:

This change adjusts the layout of `root_domain` to isolate these contended
fields across separate cache lines:
1. `rto_count` remains in the 1st cache line; `overloaded` and
   `overutilized` are moved to the last cache line
2. `rto_push_work` is placed in the 2nd cache line
3. `rto_loop_start`, `rto_loop_next`, and `rto_lock` remain in the 3rd
   cache line; `rto_mask` is moved near `pd` in the penultimate cache line
4. `cpupri` starts at the 4th cache line to prevent `pri_to_cpu[0].count`
   contending with fields in cache line 3.

With this change:
- FPS improves by ~5%
- Kernel cycles% drops from ~20% to ~17.7%
- root_domain cache line 3 no longer appears in perf-c2c report
- cycles per load of root_domain cache line 1 is reduced to from
  ~2.8K-44K to ~2.1K-2.7K

According to the nature of the change, to my understanding, it doesn`t
introduce any negative impact in other scenario.

Note: This change increases the size of `root_domain` from 29 to 31 cache
lines, it's considered acceptable since `root_domain` is a single global
object.

Appendix:
1. Current layout of contended data structures:
struct root_domain {
    atomic_t                   refcount;             /*     0     4 */
    atomic_t                   rto_count;            /*     4     4 */
    struct callback_head rcu __attribute__((__aligned__(8)));/*8 16 */
    cpumask_var_t              span;                 /*    24     8 */
    cpumask_var_t              online;               /*    32     8 */
    bool                       overloaded;           /*    40     1 */
    bool                       overutilized;         /*    41     1 */
    /* XXX 6 bytes hole, try to pack */
    cpumask_var_t              dlo_mask;             /*    48     8 */
    atomic_t                   dlo_count;            /*    56     4 */
    /* XXX 4 bytes hole, try to pack */
    /* --- cacheline 1 boundary (64 bytes) --- */
    struct dl_bw               dl_bw;                /*    64    24 */
    struct cpudl               cpudl;                /*    88    24 */
    u64                        visit_gen;            /*   112     8 */
    struct irq_work            rto_push_work;        /*   120    32 */
    /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
    raw_spinlock_t             rto_lock;             /*   152     4 */
    int                        rto_loop;             /*   156     4 */
    int                        rto_cpu;              /*   160     4 */
    atomic_t                   rto_loop_next;        /*   164     4 */
    atomic_t                   rto_loop_start;       /*   168     4 */
    /* XXX 4 bytes hole, try to pack */
    cpumask_var_t              rto_mask;             /*   176     8 */
    struct cpupri              cpupri;               /*   184  1624 */
    /* --- cacheline 28 boundary (1792 bytes) was 16 bytes ago --- */
    struct perf_domain *       pd;                   /*  1808     8 */
    /* size: 1816, cachelines: 29, members: 21 */
    /* sum members: 1802, holes: 3, sum holes: 14 */
    /* forced alignments: 1 */
    /* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));

struct cpupri {
        struct cpupri_vec          pri_to_cpu[101];      /*     0  1616 */
        /* --- cacheline 25 boundary (1600 bytes) was 16 bytes ago --- */
        int *                      cpu_to_pri;           /*  1616     8 */

        /* size: 1624, cachelines: 26, members: 2 */
        /* last cacheline: 24 bytes */
};

struct cpupri_vec {
        atomic_t                   count;                /*     0     4 */

        /* XXX 4 bytes hole, try to pack */

        cpumask_var_t              mask;                 /*     8     8 */

        /* size: 16, cachelines: 1, members: 2 */
        /* sum members: 12, holes: 1, sum holes: 4 */
        /* last cacheline: 16 bytes */
};

2. Perf c2c report of root_domain cache line 3:
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 353       44       62    0xff14d42c400e3880
-------  -------  ------  ------  ------  ------  ------------------------
 0.00%    2.27%    0.00%  0x0     21683   6     __flush_smp_call_function_
 0.00%    2.27%    0.00%  0x0     22294   5     __flush_smp_call_function_
 0.28%    0.00%    0.00%  0x0     0       2     irq_work_queue_on
 0.28%    0.00%    0.00%  0x0     27824   4     irq_work_single
 0.00%    0.00%    1.61%  0x0     28151   6     irq_work_queue_on
 0.57%    0.00%    0.00%  0x18    21822   8     native_queued_spin_lock_sl
 0.28%    2.27%    0.00%  0x18    16101   10    native_queued_spin_lock_sl
 0.57%    0.00%    0.00%  0x18    33199   5     native_queued_spin_lock_sl
 0.00%    0.00%    1.61%  0x18    10908   32    _raw_spin_lock
 0.00%    0.00%    1.61%  0x18    59770   2     _raw_spin_lock
 0.00%    0.00%    1.61%  0x18    0       1     _raw_spin_unlock
 1.42%    0.00%    0.00%  0x20    12918   20    pull_rt_task
 0.85%    0.00%   25.81%  0x24    31123   199   pull_rt_task
 0.85%    0.00%    3.23%  0x24    38218   24    pull_rt_task
 0.57%    4.55%   19.35%  0x28    30558   207   pull_rt_task
 0.28%    0.00%    0.00%  0x28    55504   10    pull_rt_task
18.70%   18.18%    0.00%  0x30    26438   291   dequeue_pushable_task
17.28%   22.73%    0.00%  0x30    29347   281   enqueue_pushable_task
 1.70%    2.27%    0.00%  0x30    12819   31    enqueue_pushable_task
 0.28%    0.00%    0.00%  0x30    17726   18    dequeue_pushable_task
34.56%   29.55%    0.00%  0x38    25509   527   cpupri_find_fitness
13.88%   11.36%   24.19%  0x38    30654   342   cpupri_set
 3.12%    2.27%    0.00%  0x38    18093   39    cpupri_set
 1.70%    0.00%    0.00%  0x38    37661   52    cpupri_find_fitness
 1.42%    2.27%   19.35%  0x38    31110   211   cpupri_set
 1.42%    0.00%    1.61%  0x38    45035   31    cpupri_set

3. Perf c2c report of root_domain cache line 1:
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 231       43       48    0xff14d42c400e3800
-------  -------  ------  ------  ------  ------  ------------------------
22.51%   18.60%    0.00%  0x4     5041    247   pull_rt_task
 5.63%    2.33%   45.83%  0x4     6995    315   dequeue_pushable_task
 3.90%    4.65%   54.17%  0x4     6587    370   enqueue_pushable_task
 0.43%    0.00%    0.00%  0x4     17111   4     enqueue_pushable_task
 0.43%    0.00%    0.00%  0x4     44062   4     dequeue_pushable_task
32.03%   27.91%    0.00%  0x28    6393    285   enqueue_task_rt
16.45%   27.91%    0.00%  0x28    5534    139   sched_balance_newidle
14.72%   18.60%    0.00%  0x28    5287    110   dequeue_task_rt
 3.46%    0.00%    0.00%  0x28    2820    25    enqueue_task_fair
 0.43%    0.00%    0.00%  0x28    220     3     enqueue_task_stop

Signed-off-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/sched.h | 52 +++++++++++++++++++++++---------------------
 1 file changed, 27 insertions(+), 25 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 475bb5998295..dd3c79470bfc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -968,24 +968,29 @@ struct root_domain {
 	cpumask_var_t		span;
 	cpumask_var_t		online;
 
+	atomic_t		dlo_count;
+	struct dl_bw		dl_bw;
+	struct cpudl		cpudl;
+
+#ifdef HAVE_RT_PUSH_IPI
 	/*
-	 * Indicate pullable load on at least one CPU, e.g:
-	 * - More than one runnable task
-	 * - Running task is misfit
+	 * For IPI pull requests, loop across the rto_mask.
 	 */
-	bool			overloaded;
-
-	/* Indicate one or more CPUs over-utilized (tipping point) */
-	bool			overutilized;
+	struct irq_work		rto_push_work;
+	raw_spinlock_t		rto_lock;
+	/* These are only updated and read within rto_lock */
+	int			rto_loop;
+	int			rto_cpu;
+	/* These atomics are updated outside of a lock */
+	atomic_t		rto_loop_next;
+	atomic_t		rto_loop_start;
+#endif
 
 	/*
 	 * The bit corresponding to a CPU gets set here if such CPU has more
 	 * than one runnable -deadline task (as it is below for RT tasks).
 	 */
 	cpumask_var_t		dlo_mask;
-	atomic_t		dlo_count;
-	struct dl_bw		dl_bw;
-	struct cpudl		cpudl;
 
 	/*
 	 * Indicate whether a root_domain's dl_bw has been checked or
@@ -995,32 +1000,29 @@ struct root_domain {
 	 * that u64 is 'big enough'. So that shouldn't be a concern.
 	 */
 	u64 visit_cookie;
+	struct cpupri		cpupri	____cacheline_aligned;
 
-#ifdef HAVE_RT_PUSH_IPI
 	/*
-	 * For IPI pull requests, loop across the rto_mask.
+	 * NULL-terminated list of performance domains intersecting with the
+	 * CPUs of the rd. Protected by RCU.
 	 */
-	struct irq_work		rto_push_work;
-	raw_spinlock_t		rto_lock;
-	/* These are only updated and read within rto_lock */
-	int			rto_loop;
-	int			rto_cpu;
-	/* These atomics are updated outside of a lock */
-	atomic_t		rto_loop_next;
-	atomic_t		rto_loop_start;
-#endif
+	struct perf_domain __rcu *pd	____cacheline_aligned;
+
 	/*
 	 * The "RT overload" flag: it gets set if a CPU has more than
 	 * one runnable RT task.
 	 */
 	cpumask_var_t		rto_mask;
-	struct cpupri		cpupri;
 
 	/*
-	 * NULL-terminated list of performance domains intersecting with the
-	 * CPUs of the rd. Protected by RCU.
+	 * Indicate pullable load on at least one CPU, e.g:
+	 * - More than one runnable task
+	 * - Running task is misfit
 	 */
-	struct perf_domain __rcu *pd;
+	bool			overloaded	____cacheline_aligned;
+
+	/* Indicate one or more CPUs over-utilized (tipping point) */
+	bool			overutilized;
 };
 
 extern void init_defrootdomain(void);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  2025-07-07  2:35 [PATCH 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
  2025-07-07  2:35 ` [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
  2025-07-07  2:35 ` [PATCH 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
@ 2025-07-07  2:35 ` Pan Deng
  2025-07-07  6:53   ` kernel test robot
                     ` (2 more replies)
  2025-07-07  2:35 ` [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
  3 siblings, 3 replies; 16+ messages in thread
From: Pan Deng @ 2025-07-07  2:35 UTC (permalink / raw)
  To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng

When running a multi-instance FFmpeg workload on HCC system, significant
contention is observed on root_domain `rto_count` and `overloaded` fields.

The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.

perf c2c tool reveals:
root_domain cache line 1:
- `rto_count` (0x4) is frequently loaded/stored
- `overloaded` (0x28) is heavily loaded
- cycles per load: ~2.8K to 44K:

A separate patch rearranges root_domain to place `overloaded` on a
different cache line, but this alone is insufficient to resolve the
contention on `rto_count`. As a complementary, this patch splits
`rto_count` into per-numa-node counters to reduce the contention.

With this change:
- FPS improves by ~4%
- Kernel cycles% drops from ~20% to ~18.6%
- The cache line no longer appears in perf-c2c report

Appendix:
1. Perf c2c report of root_domain cache line 1:
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 231       43       48    0xff14d42c400e3800
-------  -------  ------  ------  ------  ------  ------------------------
22.51%   18.60%    0.00%  0x4     5041    247   pull_rt_task
 5.63%    2.33%   45.83%  0x4     6995    315   dequeue_pushable_task
 3.90%    4.65%   54.17%  0x4     6587    370   enqueue_pushable_task
 0.43%    0.00%    0.00%  0x4     17111   4     enqueue_pushable_task
 0.43%    0.00%    0.00%  0x4     44062   4     dequeue_pushable_task
32.03%   27.91%    0.00%  0x28    6393    285   enqueue_task_rt
16.45%   27.91%    0.00%  0x28    5534    139   sched_balance_newidle
14.72%   18.60%    0.00%  0x28    5287    110   dequeue_task_rt
 3.46%    0.00%    0.00%  0x28    2820    25    enqueue_task_fair
 0.43%    0.00%    0.00%  0x28    220     3     enqueue_task_stop

Signed-off-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/rt.c       | 65 +++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h    |  9 +++++-
 kernel/sched/topology.c |  7 +++++
 3 files changed, 77 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e40422c37033..cc820dbde6d6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -337,9 +337,58 @@ static inline bool need_pull_rt_task(struct rq *rq, struct task_struct *prev)
 	return rq->online && rq->rt.highest_prio.curr > prev->prio;
 }
 
+int rto_counts_init(atomic_tp **rto_counts)
+{
+	int i;
+	atomic_tp *counts = kzalloc(nr_node_ids * sizeof(atomic_tp), GFP_KERNEL);
+
+	if (!counts)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_node_ids; i++) {
+		counts[i] = kzalloc_node(sizeof(atomic_t), GFP_KERNEL, i);
+
+		if (!counts[i])
+			goto cleanup;
+	}
+
+	*rto_counts = counts;
+	return 0;
+
+cleanup:
+	while (i--)
+		kfree(counts[i]);
+
+	kfree(counts);
+	return -ENOMEM;
+}
+
+void rto_counts_cleanup(atomic_tp *rto_counts)
+{
+	for (int i = 0; i < nr_node_ids; i++)
+		kfree(rto_counts[i]);
+
+	kfree(rto_counts);
+}
+
 static inline int rt_overloaded(struct rq *rq)
 {
-	return atomic_read(&rq->rd->rto_count);
+	int count = 0;
+	int cur_node, nid;
+
+	cur_node = numa_node_id();
+
+	for (int i = 0; i < nr_node_ids; i++) {
+		nid = (cur_node + i) % nr_node_ids;
+		count += atomic_read(rq->rd->rto_counts[nid]);
+
+		// The caller only checks if it is 0
+		// or 1, so that return once > 1
+		if (count > 1)
+			return count;
+	}
+
+	return count;
 }
 
 static inline void rt_set_overload(struct rq *rq)
@@ -358,7 +407,7 @@ static inline void rt_set_overload(struct rq *rq)
 	 * Matched by the barrier in pull_rt_task().
 	 */
 	smp_wmb();
-	atomic_inc(&rq->rd->rto_count);
+	atomic_inc(rq->rd->rto_counts[cpu_to_node(rq->cpu)]);
 }
 
 static inline void rt_clear_overload(struct rq *rq)
@@ -367,7 +416,7 @@ static inline void rt_clear_overload(struct rq *rq)
 		return;
 
 	/* the order here really doesn't matter */
-	atomic_dec(&rq->rd->rto_count);
+	atomic_dec(rq->rd->rto_counts[cpu_to_node(rq->cpu)]);
 	cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask);
 }
 
@@ -443,6 +492,16 @@ static inline void dequeue_pushable_task(struct rq *rq, struct task_struct *p)
 static inline void rt_queue_push_tasks(struct rq *rq)
 {
 }
+
+int rto_counts_init(atomic_tp **rto_counts)
+{
+	return 0;
+}
+
+void rto_counts_cleanup(atomic_tp *rto_counts)
+{
+}
+
 #endif /* CONFIG_SMP */
 
 static void enqueue_top_rt_rq(struct rt_rq *rt_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dd3c79470bfc..f80968724dd6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -953,6 +953,8 @@ struct perf_domain {
 	struct rcu_head rcu;
 };
 
+typedef atomic_t *atomic_tp;
+
 /*
  * We add the notion of a root-domain which will be used to define per-domain
  * variables. Each exclusive cpuset essentially defines an island domain by
@@ -963,12 +965,15 @@ struct perf_domain {
  */
 struct root_domain {
 	atomic_t		refcount;
-	atomic_t		rto_count;
 	struct rcu_head		rcu;
 	cpumask_var_t		span;
 	cpumask_var_t		online;
 
 	atomic_t		dlo_count;
+
+	/* rto_count per node */
+	atomic_tp		*rto_counts;
+
 	struct dl_bw		dl_bw;
 	struct cpudl		cpudl;
 
@@ -1030,6 +1035,8 @@ extern int sched_init_domains(const struct cpumask *cpu_map);
 extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
 extern void sched_get_rd(struct root_domain *rd);
 extern void sched_put_rd(struct root_domain *rd);
+extern int rto_counts_init(atomic_tp **rto_counts);
+extern void rto_counts_cleanup(atomic_tp *rto_counts);
 
 static inline int get_rd_overloaded(struct root_domain *rd)
 {
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b958fe48e020..166dc8177a44 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -457,6 +457,7 @@ static void free_rootdomain(struct rcu_head *rcu)
 {
 	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
 
+	rto_counts_cleanup(rd->rto_counts);
 	cpupri_cleanup(&rd->cpupri);
 	cpudl_cleanup(&rd->cpudl);
 	free_cpumask_var(rd->dlo_mask);
@@ -549,8 +550,14 @@ static int init_rootdomain(struct root_domain *rd)
 
 	if (cpupri_init(&rd->cpupri) != 0)
 		goto free_cpudl;
+
+	if (rto_counts_init(&rd->rto_counts) != 0)
+		goto free_cpupri;
+
 	return 0;
 
+free_cpupri:
+	cpupri_cleanup(&rd->cpupri);
 free_cpudl:
 	cpudl_cleanup(&rd->cpudl);
 free_rto_mask:
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2025-07-07  2:35 [PATCH 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
                   ` (2 preceding siblings ...)
  2025-07-07  2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
@ 2025-07-07  2:35 ` Pan Deng
  2025-07-21 11:23   ` Chen, Yu C
  3 siblings, 1 reply; 16+ messages in thread
From: Pan Deng @ 2025-07-07  2:35 UTC (permalink / raw)
  To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng

When running a multi-instance FFmpeg workload on HCC system, significant
contention is observed on bitmap of `cpupri_vec->cpumask`.

The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.

perf c2c tool reveals:
cpumask (bitmap) cache line of `cpupri_vec->mask`:
- bits are loaded during cpupri_find
- bits are stored during cpupri_set
- cycles per load: ~2.2K to 8.7K

This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
mitigate false sharing.

As a result:
- FPS improves by ~3.8%
- Kernel cycles% drops from ~20% to ~18.7%
- Cache line contention is mitigated, perf-c2c shows cycles per load
  drops from ~2.2K-8.7K to ~0.5K-2.2K

Note: CONFIG_CPUMASK_OFFSTACK=n remains unchanged.

Appendix:
1. Perf c2c report of `cpupri_vec->mask` bitmap cache line:
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 155       39       39    0xff14d52c4682d800
-------  -------  ------  ------  ------  ------  ------------------------
43.23%   43.59%    0.00%  0x0     3489    415   _find_first_and_bit
 3.23%    5.13%    0.00%  0x0     3478    107   __bitmap_and
 3.23%    0.00%    0.00%  0x0     2712    33    _find_first_and_bit
 1.94%    0.00%    7.69%  0x0     5992    33    cpupri_set
 0.00%    0.00%    5.13%  0x0     3733    19    cpupri_set
12.90%   12.82%    0.00%  0x8     3452    297   _find_first_and_bit
 1.29%    2.56%    0.00%  0x8     3007    117   __bitmap_and
 0.00%    5.13%    0.00%  0x8     3041    20    _find_first_and_bit
 0.00%    2.56%    2.56%  0x8     2374    22    cpupri_set
 0.00%    0.00%    7.69%  0x8     4194    38    cpupri_set
 8.39%    2.56%    0.00%  0x10    3336    264   _find_first_and_bit
 3.23%    0.00%    0.00%  0x10    3023    46    _find_first_and_bit
 2.58%    0.00%    0.00%  0x10    3040    130   __bitmap_and
 1.29%    0.00%   12.82%  0x10    4075    34    cpupri_set
 0.00%    0.00%    2.56%  0x10    2197    19    cpupri_set
 0.00%    2.56%    7.69%  0x18    4085    27    cpupri_set
 0.00%    2.56%    0.00%  0x18    3128    220   _find_first_and_bit
 0.00%    0.00%    5.13%  0x18    3028    20    cpupri_set
 2.58%    2.56%    0.00%  0x20    3089    198   _find_first_and_bit
 1.29%    0.00%    5.13%  0x20    5114    29    cpupri_set
 0.65%    2.56%    0.00%  0x20    3224    96    __bitmap_and
 0.65%    0.00%    7.69%  0x20    4392    31    cpupri_set
 2.58%    0.00%    0.00%  0x28    3327    214   _find_first_and_bit
 0.65%    2.56%    5.13%  0x28    5252    31    cpupri_set
 0.65%    0.00%    7.69%  0x28    8755    25    cpupri_set
 0.65%    0.00%    0.00%  0x28    4414    14    _find_first_and_bit
 1.29%    2.56%    0.00%  0x30    3139    171   _find_first_and_bit
 0.65%    0.00%    7.69%  0x30    2185    18    cpupri_set
 0.65%    0.00%    0.00%  0x30    3404    108   __bitmap_and
 0.00%    0.00%    2.56%  0x30    5542    21    cpupri_set
 3.23%    5.13%    0.00%  0x38    3493    190   _find_first_and_bit
 3.23%    2.56%    0.00%  0x38    3171    108   __bitmap_and
 0.00%    2.56%    7.69%  0x38    3285    14    cpupri_set
 0.00%    0.00%    5.13%  0x38    4035    27    cpupri_set

Signed-off-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/cpupri.c | 200 ++++++++++++++++++++++++++++++++++++++----
 kernel/sched/cpupri.h |   4 +
 2 files changed, 186 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index 42c40cfdf836..306b6baff4cd 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -64,6 +64,143 @@ static int convert_prio(int prio)
 	return cpupri;
 }
 
+#ifdef	CONFIG_CPUMASK_OFFSTACK
+static inline int alloc_vec_masks(struct cpupri_vec *vec)
+{
+	int i;
+
+	for (i = 0; i < nr_node_ids; i++) {
+		if (!zalloc_cpumask_var_node(&vec->masks[i], GFP_KERNEL, i))
+			goto cleanup;
+
+		// Clear masks of cur node, set others
+		bitmap_complement(cpumask_bits(vec->masks[i]),
+			cpumask_bits(cpumask_of_node(i)), small_cpumask_bits);
+	}
+	return 0;
+
+cleanup:
+	while (i--)
+		free_cpumask_var(vec->masks[i]);
+	return -ENOMEM;
+}
+
+static inline void free_vec_masks(struct cpupri_vec *vec)
+{
+	for (int i = 0; i < nr_node_ids; i++)
+		free_cpumask_var(vec->masks[i]);
+}
+
+static inline int setup_vec_mask_var_ts(struct cpupri *cp)
+{
+	int i;
+
+	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+		struct cpupri_vec *vec = &cp->pri_to_cpu[i];
+
+		vec->masks = kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL);
+		if (!vec->masks)
+			goto cleanup;
+	}
+	return 0;
+
+cleanup:
+	/* Free any already allocated masks */
+	while (i--) {
+		kfree(cp->pri_to_cpu[i].masks);
+		cp->pri_to_cpu[i].masks = NULL;
+	}
+
+	return -ENOMEM;
+}
+
+static inline void free_vec_mask_var_ts(struct cpupri *cp)
+{
+	for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+		kfree(cp->pri_to_cpu[i].masks);
+		cp->pri_to_cpu[i].masks = NULL;
+	}
+}
+
+static inline int
+available_cpu_in_nodes(struct task_struct *p, struct cpupri_vec *vec)
+{
+	int cur_node = numa_node_id();
+
+	for (int i = 0; i < nr_node_ids; i++) {
+		int nid = (cur_node + i) % nr_node_ids;
+
+		if (cpumask_first_and_and(&p->cpus_mask, vec->masks[nid],
+					cpumask_of_node(nid)) < nr_cpu_ids)
+			return 1;
+	}
+
+	return 0;
+}
+
+#define available_cpu_in_vec available_cpu_in_nodes
+
+#else /* !CONFIG_CPUMASK_OFFSTACK */
+
+static inline int alloc_vec_masks(struct cpupri_vec *vec)
+{
+	if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static inline void free_vec_masks(struct cpupri_vec *vec)
+{
+	free_cpumask_var(vec->mask);
+}
+
+static inline int setup_vec_mask_var_ts(struct cpupri *cp)
+{
+	return 0;
+}
+
+static inline void free_vec_mask_var_ts(struct cpupri *cp)
+{
+}
+
+static inline int
+available_cpu_in_vec(struct task_struct *p, struct cpupri_vec *vec)
+{
+	if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids)
+		return 0;
+
+	return 1;
+}
+#endif
+
+static inline int alloc_all_masks(struct cpupri *cp)
+{
+	int i;
+
+	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+		if (alloc_vec_masks(&cp->pri_to_cpu[i]))
+			goto cleanup;
+	}
+
+	return 0;
+
+cleanup:
+	while (i--)
+		free_vec_masks(&cp->pri_to_cpu[i]);
+
+	return -ENOMEM;
+}
+
+static inline void setup_vec_counts(struct cpupri *cp)
+{
+	for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+		struct cpupri_vec *vec = &cp->pri_to_cpu[i];
+
+		atomic_set(&vec->count, 0);
+	}
+}
+
 static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 				struct cpumask *lowest_mask, int idx)
 {
@@ -96,11 +233,24 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
 	if (skip)
 		return 0;
 
-	if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids)
+	if (!available_cpu_in_vec(p, vec))
 		return 0;
 
+#ifdef	CONFIG_CPUMASK_OFFSTACK
+	struct cpumask *cpupri_mask = lowest_mask;
+
+	// available && lowest_mask
+	if (lowest_mask) {
+		cpumask_copy(cpupri_mask, vec->masks[0]);
+		for (int nid = 1; nid < nr_node_ids; nid++)
+			cpumask_and(cpupri_mask, cpupri_mask, vec->masks[nid]);
+	}
+#else
+	struct cpumask *cpupri_mask = vec->mask;
+#endif
+
 	if (lowest_mask) {
-		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+		cpumask_and(lowest_mask, &p->cpus_mask, cpupri_mask);
 		cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
 
 		/*
@@ -229,7 +379,11 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 	if (likely(newpri != CPUPRI_INVALID)) {
 		struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
 
+#ifdef	CONFIG_CPUMASK_OFFSTACK
+		cpumask_set_cpu(cpu, vec->masks[cpu_to_node(cpu)]);
+#else
 		cpumask_set_cpu(cpu, vec->mask);
+#endif
 		/*
 		 * When adding a new vector, we update the mask first,
 		 * do a write memory barrier, and then update the count, to
@@ -263,7 +417,11 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
 		 */
 		atomic_dec(&(vec)->count);
 		smp_mb__after_atomic();
+#ifdef	CONFIG_CPUMASK_OFFSTACK
+		cpumask_clear_cpu(cpu, vec->masks[cpu_to_node(cpu)]);
+#else
 		cpumask_clear_cpu(cpu, vec->mask);
+#endif
 	}
 
 	*currpri = newpri;
@@ -279,26 +437,31 @@ int cpupri_init(struct cpupri *cp)
 {
 	int i;
 
-	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
-		struct cpupri_vec *vec = &cp->pri_to_cpu[i];
-
-		atomic_set(&vec->count, 0);
-		if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL))
-			goto cleanup;
-	}
-
+	/* Allocate the cpu_to_pri array */
 	cp->cpu_to_pri = kcalloc(nr_cpu_ids, sizeof(int), GFP_KERNEL);
 	if (!cp->cpu_to_pri)
-		goto cleanup;
+		return -ENOMEM;
 
+	/* Initialize all CPUs to invalid priority */
 	for_each_possible_cpu(i)
 		cp->cpu_to_pri[i] = CPUPRI_INVALID;
 
+	/* Setup priority vectors */
+	setup_vec_counts(cp);
+	if (setup_vec_mask_var_ts(cp))
+		goto fail_setup_vectors;
+
+	/* Allocate masks for each priority vector */
+	if (alloc_all_masks(cp))
+		goto fail_alloc_masks;
+
 	return 0;
 
-cleanup:
-	for (i--; i >= 0; i--)
-		free_cpumask_var(cp->pri_to_cpu[i].mask);
+fail_alloc_masks:
+	free_vec_mask_var_ts(cp);
+
+fail_setup_vectors:
+	kfree(cp->cpu_to_pri);
 	return -ENOMEM;
 }
 
@@ -308,9 +471,10 @@ int cpupri_init(struct cpupri *cp)
  */
 void cpupri_cleanup(struct cpupri *cp)
 {
-	int i;
-
 	kfree(cp->cpu_to_pri);
-	for (i = 0; i < CPUPRI_NR_PRIORITIES; i++)
-		free_cpumask_var(cp->pri_to_cpu[i].mask);
+
+	for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++)
+		free_vec_masks(&cp->pri_to_cpu[i]);
+
+	free_vec_mask_var_ts(cp);
 }
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index 245b0fa626be..c53f1f4dad86 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -9,7 +9,11 @@
 
 struct cpupri_vec {
 	atomic_t		count;
+#ifdef CONFIG_CPUMASK_OFFSTACK
+	cpumask_var_t		*masks	____cacheline_aligned;
+#else
 	cpumask_var_t		mask	____cacheline_aligned;
+#endif
 };
 
 struct cpupri {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  2025-07-07  2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
@ 2025-07-07  6:53   ` kernel test robot
  2025-07-07 11:36     ` Deng, Pan
  2025-07-07  6:53   ` kernel test robot
  2025-07-08  5:33   ` kernel test robot
  2 siblings, 1 reply; 16+ messages in thread
From: kernel test robot @ 2025-07-07  6:53 UTC (permalink / raw)
  To: Pan Deng, mingo
  Cc: llvm, oe-kbuild-all, linux-kernel, tianyou.li, tim.c.chen,
	yu.c.chen, pan.deng

Hi Pan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on v6.16-rc5]
[also build test WARNING on linus/master]
[cannot apply to tip/sched/core peterz-queue/sched/core tip/master tip/auto-latest next-20250704]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Pan-Deng/sched-rt-Optimize-cpupri_vec-layout-to-mitigate-cache-line-contention/20250707-131831
base:   v6.16-rc5
patch link:    https://lore.kernel.org/r/2c1e1dbacaddd881f3cca340ece1f9268029b620.1751852370.git.pan.deng%40intel.com
patch subject: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20250707/202507071418.sFa0bilv-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project 01c97b4953e87ae455bd4c41e3de3f0f0f29c61c)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250707/202507071418.sFa0bilv-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507071418.sFa0bilv-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from kernel/sched/build_policy.c:52:
   kernel/sched/rt.c:496:21: error: unknown type name 'atomic_tp'; did you mean 'atomic_t'?
     496 | int rto_counts_init(atomic_tp **rto_counts)
         |                     ^~~~~~~~~
         |                     atomic_t
   include/linux/types.h:183:3: note: 'atomic_t' declared here
     183 | } atomic_t;
         |   ^
   In file included from kernel/sched/build_policy.c:52:
>> kernel/sched/rt.c:496:5: warning: no previous prototype for function 'rto_counts_init' [-Wmissing-prototypes]
     496 | int rto_counts_init(atomic_tp **rto_counts)
         |     ^
   kernel/sched/rt.c:496:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     496 | int rto_counts_init(atomic_tp **rto_counts)
         | ^
         | static 
   kernel/sched/rt.c:501:25: error: unknown type name 'atomic_tp'; did you mean 'atomic_t'?
     501 | void rto_counts_cleanup(atomic_tp *rto_counts)
         |                         ^~~~~~~~~
         |                         atomic_t
   include/linux/types.h:183:3: note: 'atomic_t' declared here
     183 | } atomic_t;
         |   ^
   In file included from kernel/sched/build_policy.c:52:
>> kernel/sched/rt.c:501:6: warning: no previous prototype for function 'rto_counts_cleanup' [-Wmissing-prototypes]
     501 | void rto_counts_cleanup(atomic_tp *rto_counts)
         |      ^
   kernel/sched/rt.c:501:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     501 | void rto_counts_cleanup(atomic_tp *rto_counts)
         | ^
         | static 
   2 warnings and 2 errors generated.


vim +/rto_counts_init +496 kernel/sched/rt.c

   495	
 > 496	int rto_counts_init(atomic_tp **rto_counts)
   497	{
   498		return 0;
   499	}
   500	
 > 501	void rto_counts_cleanup(atomic_tp *rto_counts)
   502	{
   503	}
   504	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  2025-07-07  2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
  2025-07-07  6:53   ` kernel test robot
@ 2025-07-07  6:53   ` kernel test robot
  2025-07-08  5:33   ` kernel test robot
  2 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2025-07-07  6:53 UTC (permalink / raw)
  To: Pan Deng, peterz, mingo
  Cc: oe-kbuild-all, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen,
	pan.deng

Hi Pan,

kernel test robot noticed the following build errors:

[auto build test ERROR on v6.16-rc5]
[also build test ERROR on linus/master]
[cannot apply to tip/sched/core peterz-queue/sched/core tip/master tip/auto-latest next-20250704]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Pan-Deng/sched-rt-Optimize-cpupri_vec-layout-to-mitigate-cache-line-contention/20250707-131831
base:   v6.16-rc5
patch link:    https://lore.kernel.org/r/2c1e1dbacaddd881f3cca340ece1f9268029b620.1751852370.git.pan.deng%40intel.com
patch subject: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
config: arm-randconfig-002-20250707 (https://download.01.org/0day-ci/archive/20250707/202507071453.DYRB711b-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250707/202507071453.DYRB711b-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507071453.DYRB711b-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from kernel/sched/build_policy.c:52:
>> kernel/sched/rt.c:496:21: error: unknown type name 'atomic_tp'; did you mean 'atomic_t'?
     496 | int rto_counts_init(atomic_tp **rto_counts)
         |                     ^~~~~~~~~
         |                     atomic_t
   kernel/sched/rt.c:501:25: error: unknown type name 'atomic_tp'; did you mean 'atomic_t'?
     501 | void rto_counts_cleanup(atomic_tp *rto_counts)
         |                         ^~~~~~~~~
         |                         atomic_t


vim +496 kernel/sched/rt.c

   495	
 > 496	int rto_counts_init(atomic_tp **rto_counts)
   497	{
   498		return 0;
   499	}
   500	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  2025-07-07  6:53   ` kernel test robot
@ 2025-07-07 11:36     ` Deng, Pan
  0 siblings, 0 replies; 16+ messages in thread
From: Deng, Pan @ 2025-07-07 11:36 UTC (permalink / raw)
  To: lkp, mingo@kernel.org
  Cc: llvm@lists.linux.dev, oe-kbuild-all@lists.linux.dev,
	linux-kernel@vger.kernel.org, Li, Tianyou,
	tim.c.chen@linux.intel.com, Chen, Yu C

The issue arises from redundant functions when CONFIG_SMP is disabled, it will be addressed along with other feedback in the next version.

Best Regards
Pan

> -----Original Message-----
> From: lkp <lkp@intel.com>
> Sent: Monday, July 7, 2025 2:53 PM
> To: Deng, Pan <pan.deng@intel.com>; mingo@kernel.org
> Cc: llvm@lists.linux.dev; oe-kbuild-all@lists.linux.dev; linux-
> kernel@vger.kernel.org; Li, Tianyou <tianyou.li@intel.com>;
> tim.c.chen@linux.intel.com; Chen, Yu C <yu.c.chen@intel.com>; Deng, Pan
> <pan.deng@intel.com>
> Subject: Re: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-
> node counters
> 
> Hi Pan,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on v6.16-rc5]
> [also build test WARNING on linus/master] [cannot apply to tip/sched/core
> peterz-queue/sched/core tip/master tip/auto-latest next-20250704] [If your
> patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Pan-Deng/sched-rt-
> Optimize-cpupri_vec-layout-to-mitigate-cache-line-contention/20250707-
> 131831
> base:   v6.16-rc5
> patch link:
> https://lore.kernel.org/r/2c1e1dbacaddd881f3cca340ece1f9268029b620.1751
> 852370.git.pan.deng%40intel.com
> patch subject: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-
> NUMA-node counters
> config: arm-allnoconfig (https://download.01.org/0day-
> ci/archive/20250707/202507071418.sFa0bilv-lkp@intel.com/config)
> compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project
> 01c97b4953e87ae455bd4c41e3de3f0f0f29c61c)
> reproduce (this is a W=1 build): (https://download.01.org/0day-
> ci/archive/20250707/202507071418.sFa0bilv-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of the
> same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes:
> | https://lore.kernel.org/oe-kbuild-all/202507071418.sFa0bilv-lkp@intel.
> | com/
> 
> All warnings (new ones prefixed by >>):
> 
>    In file included from kernel/sched/build_policy.c:52:
>    kernel/sched/rt.c:496:21: error: unknown type name 'atomic_tp'; did you
> mean 'atomic_t'?
>      496 | int rto_counts_init(atomic_tp **rto_counts)
>          |                     ^~~~~~~~~
>          |                     atomic_t
>    include/linux/types.h:183:3: note: 'atomic_t' declared here
>      183 | } atomic_t;
>          |   ^
>    In file included from kernel/sched/build_policy.c:52:
> >> kernel/sched/rt.c:496:5: warning: no previous prototype for function
> >> 'rto_counts_init' [-Wmissing-prototypes]
>      496 | int rto_counts_init(atomic_tp **rto_counts)
>          |     ^
>    kernel/sched/rt.c:496:1: note: declare 'static' if the function is not intended to
> be used outside of this translation unit
>      496 | int rto_counts_init(atomic_tp **rto_counts)
>          | ^
>          | static
>    kernel/sched/rt.c:501:25: error: unknown type name 'atomic_tp'; did you
> mean 'atomic_t'?
>      501 | void rto_counts_cleanup(atomic_tp *rto_counts)
>          |                         ^~~~~~~~~
>          |                         atomic_t
>    include/linux/types.h:183:3: note: 'atomic_t' declared here
>      183 | } atomic_t;
>          |   ^
>    In file included from kernel/sched/build_policy.c:52:
> >> kernel/sched/rt.c:501:6: warning: no previous prototype for function
> >> 'rto_counts_cleanup' [-Wmissing-prototypes]
>      501 | void rto_counts_cleanup(atomic_tp *rto_counts)
>          |      ^
>    kernel/sched/rt.c:501:1: note: declare 'static' if the function is not intended to
> be used outside of this translation unit
>      501 | void rto_counts_cleanup(atomic_tp *rto_counts)
>          | ^
>          | static
>    2 warnings and 2 errors generated.
> 
> 
> vim +/rto_counts_init +496 kernel/sched/rt.c
> 
>    495
>  > 496	int rto_counts_init(atomic_tp **rto_counts)
>    497	{
>    498		return 0;
>    499	}
>    500
>  > 501	void rto_counts_cleanup(atomic_tp *rto_counts)
>    502	{
>    503	}
>    504
> 
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  2025-07-07  2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
  2025-07-07  6:53   ` kernel test robot
  2025-07-07  6:53   ` kernel test robot
@ 2025-07-08  5:33   ` kernel test robot
  2025-07-08 14:02     ` Deng, Pan
  2 siblings, 1 reply; 16+ messages in thread
From: kernel test robot @ 2025-07-08  5:33 UTC (permalink / raw)
  To: Pan Deng, peterz, mingo
  Cc: oe-kbuild-all, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen,
	pan.deng

Hi Pan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on v6.16-rc5]
[also build test WARNING on linus/master]
[cannot apply to tip/sched/core peterz-queue/sched/core tip/master tip/auto-latest next-20250704]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Pan-Deng/sched-rt-Optimize-cpupri_vec-layout-to-mitigate-cache-line-contention/20250707-131831
base:   v6.16-rc5
patch link:    https://lore.kernel.org/r/2c1e1dbacaddd881f3cca340ece1f9268029b620.1751852370.git.pan.deng%40intel.com
patch subject: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
config: loongarch-randconfig-r112-20250708 (https://download.01.org/0day-ci/archive/20250708/202507081317.4IdE2euZ-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 15.1.0
reproduce: (https://download.01.org/0day-ci/archive/20250708/202507081317.4IdE2euZ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507081317.4IdE2euZ-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   kernel/sched/rt.c:1679:45: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/rt.c:1679:45: sparse:     expected struct task_struct *p
   kernel/sched/rt.c:1679:45: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/rt.c:1722:39: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *donor @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/rt.c:1722:39: sparse:     expected struct task_struct *donor
   kernel/sched/rt.c:1722:39: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/rt.c:1742:64: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *tsk @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/rt.c:1742:64: sparse:     expected struct task_struct *tsk
   kernel/sched/rt.c:1742:64: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/rt.c:2084:40: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *task @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/rt.c:2084:40: sparse:     expected struct task_struct *task
   kernel/sched/rt.c:2084:40: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/rt.c:2107:13: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/rt.c:2107:13: sparse:    struct task_struct *
   kernel/sched/rt.c:2107:13: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/rt.c:2453:54: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *tsk @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/rt.c:2453:54: sparse:     expected struct task_struct *tsk
   kernel/sched/rt.c:2453:54: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/rt.c:2455:40: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/rt.c:2455:40: sparse:     expected struct task_struct *p
   kernel/sched/rt.c:2455:40: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/rt.c:2455:62: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/rt.c:2455:62: sparse:     expected struct task_struct *p
   kernel/sched/rt.c:2455:62: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/build_policy.c: note: in included file:
   kernel/sched/deadline.c:2717:23: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/deadline.c:2717:23: sparse:     expected struct task_struct *p
   kernel/sched/deadline.c:2717:23: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/deadline.c:2727:13: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/deadline.c:2727:13: sparse:    struct task_struct *
   kernel/sched/deadline.c:2727:13: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/deadline.c:2833:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/deadline.c:2833:25: sparse:    struct task_struct *
   kernel/sched/deadline.c:2833:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/deadline.c:2357:42: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected struct sched_dl_entity const *b @@     got struct sched_dl_entity [noderef] __rcu * @@
   kernel/sched/deadline.c:2357:42: sparse:     expected struct sched_dl_entity const *b
   kernel/sched/deadline.c:2357:42: sparse:     got struct sched_dl_entity [noderef] __rcu *
   kernel/sched/deadline.c:2368:38: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *tsk @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/deadline.c:2368:38: sparse:     expected struct task_struct *tsk
   kernel/sched/deadline.c:2368:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/deadline.c:1262:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/deadline.c:1262:39: sparse:     expected struct task_struct *p
   kernel/sched/deadline.c:1262:39: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/deadline.c:1262:85: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected struct sched_dl_entity const *b @@     got struct sched_dl_entity [noderef] __rcu * @@
   kernel/sched/deadline.c:1262:85: sparse:     expected struct sched_dl_entity const *b
   kernel/sched/deadline.c:1262:85: sparse:     got struct sched_dl_entity [noderef] __rcu *
   kernel/sched/deadline.c:1362:23: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/deadline.c:1362:23: sparse:     expected struct task_struct *p
   kernel/sched/deadline.c:1362:23: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/deadline.c:1671:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/deadline.c:1671:31: sparse:     expected struct task_struct *p
   kernel/sched/deadline.c:1671:31: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/deadline.c:1671:70: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected struct sched_dl_entity const *b @@     got struct sched_dl_entity [noderef] __rcu * @@
   kernel/sched/deadline.c:1671:70: sparse:     expected struct sched_dl_entity const *b
   kernel/sched/deadline.c:1671:70: sparse:     got struct sched_dl_entity [noderef] __rcu *
   kernel/sched/deadline.c:1760:39: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *donor @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/deadline.c:1760:39: sparse:     expected struct task_struct *donor
   kernel/sched/deadline.c:1760:39: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/deadline.c:2578:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/deadline.c:2578:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/deadline.c:2578:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/deadline.c:2242:14: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct task_struct *curr @@     got struct task_struct [noderef] __rcu * @@
   kernel/sched/deadline.c:2242:14: sparse:     expected struct task_struct *curr
   kernel/sched/deadline.c:2242:14: sparse:     got struct task_struct [noderef] __rcu *
   kernel/sched/deadline.c:2243:15: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct task_struct *donor @@     got struct task_struct [noderef] __rcu * @@
   kernel/sched/deadline.c:2243:15: sparse:     expected struct task_struct *donor
   kernel/sched/deadline.c:2243:15: sparse:     got struct task_struct [noderef] __rcu *
   kernel/sched/deadline.c:2318:43: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/deadline.c:2318:43: sparse:     expected struct task_struct *p
   kernel/sched/deadline.c:2318:43: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/deadline.c:2878:38: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *tsk @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/deadline.c:2878:38: sparse:     expected struct task_struct *tsk
   kernel/sched/deadline.c:2878:38: sparse:     got struct task_struct [noderef] __rcu *curr
   kernel/sched/deadline.c:2880:23: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/deadline.c:2880:23: sparse:     expected struct task_struct *p
   kernel/sched/deadline.c:2880:23: sparse:     got struct task_struct [noderef] __rcu *donor
   kernel/sched/deadline.c:2882:44: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected struct sched_dl_entity const *b @@     got struct sched_dl_entity [noderef] __rcu * @@
   kernel/sched/deadline.c:2882:44: sparse:     expected struct sched_dl_entity const *b
   kernel/sched/deadline.c:2882:44: sparse:     got struct sched_dl_entity [noderef] __rcu *
   kernel/sched/deadline.c:3071:23: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/deadline.c:3071:23: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/deadline.c:3071:23: sparse:    struct task_struct *
   kernel/sched/deadline.c:3120:32: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *curr @@
   kernel/sched/build_policy.c: note: in included file:
   kernel/sched/syscalls.c:206:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/syscalls.c:206:22: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/syscalls.c:206:22: sparse:    struct task_struct *
   kernel/sched/build_policy.c: note: in included file:
   kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2276:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2276:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2276:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2276:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2287:26: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2287:26: sparse:    struct task_struct *
   kernel/sched/build_policy.c: note: in included file:
   kernel/sched/rt.c:2413:45: sparse: sparse: dereference of noderef expression
   kernel/sched/build_policy.c: note: in included file:
>> kernel/sched/sched.h:2627:35: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *donor @@
   kernel/sched/build_policy.c: note: in included file:
   kernel/sched/rt.c:2456:32: sparse: sparse: dereference of noderef expression
   kernel/sched/rt.c:2457:32: sparse: sparse: dereference of noderef expression
   kernel/sched/build_policy.c: note: in included file:
   kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2276:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2276:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2287:26: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2287:26: sparse:    struct task_struct *
   kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2276:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2276:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2276:25: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2276:25: sparse:    struct task_struct *
   kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2287:26: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2287:26: sparse:    struct task_struct *
   kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2287:26: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2287:26: sparse:    struct task_struct *
   kernel/sched/sched.h:2476:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2476:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2476:9: sparse:    struct task_struct *
   kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2287:26: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2287:26: sparse:    struct task_struct *
   kernel/sched/sched.h:2476:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/sched/sched.h:2476:9: sparse:    struct task_struct [noderef] __rcu *
   kernel/sched/sched.h:2476:9: sparse:    struct task_struct *
   kernel/sched/build_policy.c: note: in included file:
   kernel/sched/syscalls.c:1296:6: sparse: sparse: context imbalance in 'sched_getaffinity' - different lock contexts for basic block
   kernel/sched/build_policy.c: note: in included file:
   kernel/sched/rt.c:1767:15: sparse: sparse: dereference of noderef expression

vim +2627 kernel/sched/sched.h

04746ed80bcf31 Ingo Molnar               2024-04-07  2624  
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2625  static inline struct task_struct *get_push_task(struct rq *rq)
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2626  {
af0c8b2bf67b25 Peter Zijlstra            2024-10-09 @2627  	struct task_struct *p = rq->donor;
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2628  
5cb9eaa3d274f7 Peter Zijlstra            2020-11-17  2629  	lockdep_assert_rq_held(rq);
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2630  
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2631  	if (rq->push_busy)
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2632  		return NULL;
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2633  
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2634  	if (p->nr_cpus_allowed == 1)
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2635  		return NULL;
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2636  
e681dcbaa4b284 Sebastian Andrzej Siewior 2021-08-26  2637  	if (p->migration_disabled)
e681dcbaa4b284 Sebastian Andrzej Siewior 2021-08-26  2638  		return NULL;
e681dcbaa4b284 Sebastian Andrzej Siewior 2021-08-26  2639  
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2640  	rq->push_busy = true;
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2641  	return get_task_struct(p);
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2642  }
a7c81556ec4d34 Peter Zijlstra            2020-09-28  2643  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  2025-07-08  5:33   ` kernel test robot
@ 2025-07-08 14:02     ` Deng, Pan
  2025-07-09  8:56       ` Li, Philip
  0 siblings, 1 reply; 16+ messages in thread
From: Deng, Pan @ 2025-07-08 14:02 UTC (permalink / raw)
  To: lkp, peterz@infradead.org, mingo@kernel.org
  Cc: oe-kbuild-all@lists.linux.dev, linux-kernel@vger.kernel.org,
	Li, Tianyou, tim.c.chen@linux.intel.com, Chen, Yu C

> -----Original Message-----
> From: lkp <lkp@intel.com>
> Sent: Tuesday, July 8, 2025 1:34 PM
> To: Deng, Pan <pan.deng@intel.com>; peterz@infradead.org; mingo@kernel.org
> Cc: oe-kbuild-all@lists.linux.dev; linux-kernel@vger.kernel.org; Li, Tianyou
> <tianyou.li@intel.com>; tim.c.chen@linux.intel.com; Chen, Yu C
> <yu.c.chen@intel.com>; Deng, Pan <pan.deng@intel.com>
> Subject: Re: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-
> node counters
> 
> Hi Pan,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on v6.16-rc5]
> [also build test WARNING on linus/master] [cannot apply to tip/sched/core
> peterz-queue/sched/core tip/master tip/auto-latest next-20250704] [If your
> patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Pan-Deng/sched-rt-
> Optimize-cpupri_vec-layout-to-mitigate-cache-line-contention/20250707-131831
> base:   v6.16-rc5
> patch link:
> https://lore.kernel.org/r/2c1e1dbacaddd881f3cca340ece1f9268029b620.175185
> 2370.git.pan.deng%40intel.com
> patch subject: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-
> node counters
> config: loongarch-randconfig-r112-20250708 (https://download.01.org/0day-
> ci/archive/20250708/202507081317.4IdE2euZ-lkp@intel.com/config)
> compiler: loongarch64-linux-gcc (GCC) 15.1.0
> reproduce: (https://download.01.org/0day-
> ci/archive/20250708/202507081317.4IdE2euZ-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of the
> same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes:
> | https://lore.kernel.org/oe-kbuild-all/202507081317.4IdE2euZ-lkp@intel.
> | com/
> 
> sparse warnings: (new ones prefixed by >>)
>    kernel/sched/rt.c:1679:45: sparse: sparse: incorrect type in argument 2
> (different address spaces) @@     expected struct task_struct *p @@     got struct
> task_struct [noderef] __rcu *donor @@
>    kernel/sched/rt.c:1679:45: sparse:     expected struct task_struct *p
>    kernel/sched/rt.c:1679:45: sparse:     got struct task_struct [noderef] __rcu
> *donor
>    kernel/sched/rt.c:1722:39: sparse: sparse: incorrect type in initializer (different
> address spaces) @@     expected struct task_struct *donor @@     got struct
> task_struct [noderef] __rcu *donor @@
>    kernel/sched/rt.c:1722:39: sparse:     expected struct task_struct *donor
>    kernel/sched/rt.c:1722:39: sparse:     got struct task_struct [noderef] __rcu
> *donor
>    kernel/sched/rt.c:1742:64: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@     expected struct task_struct *tsk @@     got
> struct task_struct [noderef] __rcu *curr @@
>    kernel/sched/rt.c:1742:64: sparse:     expected struct task_struct *tsk
>    kernel/sched/rt.c:1742:64: sparse:     got struct task_struct [noderef] __rcu
> *curr
>    kernel/sched/rt.c:2084:40: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@     expected struct task_struct *task @@     got
> struct task_struct [noderef] __rcu *curr @@
>    kernel/sched/rt.c:2084:40: sparse:     expected struct task_struct *task
>    kernel/sched/rt.c:2084:40: sparse:     got struct task_struct [noderef] __rcu
> *curr
>    kernel/sched/rt.c:2107:13: sparse: sparse: incompatible types in comparison
> expression (different address spaces):
>    kernel/sched/rt.c:2107:13: sparse:    struct task_struct *
>    kernel/sched/rt.c:2107:13: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/rt.c:2453:54: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@     expected struct task_struct *tsk @@     got
> struct task_struct [noderef] __rcu *curr @@
>    kernel/sched/rt.c:2453:54: sparse:     expected struct task_struct *tsk
>    kernel/sched/rt.c:2453:54: sparse:     got struct task_struct [noderef] __rcu
> *curr
>    kernel/sched/rt.c:2455:40: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@     expected struct task_struct *p @@     got struct
> task_struct [noderef] __rcu *donor @@
>    kernel/sched/rt.c:2455:40: sparse:     expected struct task_struct *p
>    kernel/sched/rt.c:2455:40: sparse:     got struct task_struct [noderef] __rcu
> *donor
>    kernel/sched/rt.c:2455:62: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@     expected struct task_struct *p @@     got struct
> task_struct [noderef] __rcu *donor @@
>    kernel/sched/rt.c:2455:62: sparse:     expected struct task_struct *p
>    kernel/sched/rt.c:2455:62: sparse:     got struct task_struct [noderef] __rcu
> *donor
>    kernel/sched/build_policy.c: note: in included file:
>    kernel/sched/deadline.c:2717:23: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@     expected struct task_struct *p @@     got struct
> task_struct [noderef] __rcu *donor @@
>    kernel/sched/deadline.c:2717:23: sparse:     expected struct task_struct *p
>    kernel/sched/deadline.c:2717:23: sparse:     got struct task_struct [noderef]
> __rcu *donor
>    kernel/sched/deadline.c:2727:13: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/deadline.c:2727:13: sparse:    struct task_struct *
>    kernel/sched/deadline.c:2727:13: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/deadline.c:2833:25: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/deadline.c:2833:25: sparse:    struct task_struct *
>    kernel/sched/deadline.c:2833:25: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/deadline.c:2357:42: sparse: sparse: incorrect type in argument 2
> (different address spaces) @@     expected struct sched_dl_entity const *b @@
> got struct sched_dl_entity [noderef] __rcu * @@
>    kernel/sched/deadline.c:2357:42: sparse:     expected struct sched_dl_entity
> const *b
>    kernel/sched/deadline.c:2357:42: sparse:     got struct sched_dl_entity
> [noderef] __rcu *
>    kernel/sched/deadline.c:2368:38: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@     expected struct task_struct *tsk @@     got
> struct task_struct [noderef] __rcu *curr @@
>    kernel/sched/deadline.c:2368:38: sparse:     expected struct task_struct *tsk
>    kernel/sched/deadline.c:2368:38: sparse:     got struct task_struct [noderef]
> __rcu *curr
>    kernel/sched/deadline.c:1262:39: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@     expected struct task_struct *p @@     got struct
> task_struct [noderef] __rcu *curr @@
>    kernel/sched/deadline.c:1262:39: sparse:     expected struct task_struct *p
>    kernel/sched/deadline.c:1262:39: sparse:     got struct task_struct [noderef]
> __rcu *curr
>    kernel/sched/deadline.c:1262:85: sparse: sparse: incorrect type in argument 2
> (different address spaces) @@     expected struct sched_dl_entity const *b @@
> got struct sched_dl_entity [noderef] __rcu * @@
>    kernel/sched/deadline.c:1262:85: sparse:     expected struct sched_dl_entity
> const *b
>    kernel/sched/deadline.c:1262:85: sparse:     got struct sched_dl_entity
> [noderef] __rcu *
>    kernel/sched/deadline.c:1362:23: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@     expected struct task_struct *p @@     got struct
> task_struct [noderef] __rcu *donor @@
>    kernel/sched/deadline.c:1362:23: sparse:     expected struct task_struct *p
>    kernel/sched/deadline.c:1362:23: sparse:     got struct task_struct [noderef]
> __rcu *donor
>    kernel/sched/deadline.c:1671:31: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@     expected struct task_struct *p @@     got struct
> task_struct [noderef] __rcu *curr @@
>    kernel/sched/deadline.c:1671:31: sparse:     expected struct task_struct *p
>    kernel/sched/deadline.c:1671:31: sparse:     got struct task_struct [noderef]
> __rcu *curr
>    kernel/sched/deadline.c:1671:70: sparse: sparse: incorrect type in argument 2
> (different address spaces) @@     expected struct sched_dl_entity const *b @@
> got struct sched_dl_entity [noderef] __rcu * @@
>    kernel/sched/deadline.c:1671:70: sparse:     expected struct sched_dl_entity
> const *b
>    kernel/sched/deadline.c:1671:70: sparse:     got struct sched_dl_entity
> [noderef] __rcu *
>    kernel/sched/deadline.c:1760:39: sparse: sparse: incorrect type in initializer
> (different address spaces) @@     expected struct task_struct *donor @@     got
> struct task_struct [noderef] __rcu *donor @@
>    kernel/sched/deadline.c:1760:39: sparse:     expected struct task_struct *donor
>    kernel/sched/deadline.c:1760:39: sparse:     got struct task_struct [noderef]
> __rcu *donor
>    kernel/sched/deadline.c:2578:9: sparse: sparse: incorrect type in assignment
> (different address spaces) @@     expected struct sched_domain *[assigned] sd
> @@     got struct sched_domain [noderef] __rcu *parent @@
>    kernel/sched/deadline.c:2578:9: sparse:     expected struct sched_domain
> *[assigned] sd
>    kernel/sched/deadline.c:2578:9: sparse:     got struct sched_domain [noderef]
> __rcu *parent
>    kernel/sched/deadline.c:2242:14: sparse: sparse: incorrect type in assignment
> (different address spaces) @@     expected struct task_struct *curr @@     got
> struct task_struct [noderef] __rcu * @@
>    kernel/sched/deadline.c:2242:14: sparse:     expected struct task_struct *curr
>    kernel/sched/deadline.c:2242:14: sparse:     got struct task_struct [noderef]
> __rcu *
>    kernel/sched/deadline.c:2243:15: sparse: sparse: incorrect type in assignment
> (different address spaces) @@     expected struct task_struct *donor @@     got
> struct task_struct [noderef] __rcu * @@
>    kernel/sched/deadline.c:2243:15: sparse:     expected struct task_struct *donor
>    kernel/sched/deadline.c:2243:15: sparse:     got struct task_struct [noderef]
> __rcu *
>    kernel/sched/deadline.c:2318:43: sparse: sparse: incorrect type in argument 2
> (different address spaces) @@     expected struct task_struct *p @@     got struct
> task_struct [noderef] __rcu *donor @@
>    kernel/sched/deadline.c:2318:43: sparse:     expected struct task_struct *p
>    kernel/sched/deadline.c:2318:43: sparse:     got struct task_struct [noderef]
> __rcu *donor
>    kernel/sched/deadline.c:2878:38: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@     expected struct task_struct *tsk @@     got
> struct task_struct [noderef] __rcu *curr @@
>    kernel/sched/deadline.c:2878:38: sparse:     expected struct task_struct *tsk
>    kernel/sched/deadline.c:2878:38: sparse:     got struct task_struct [noderef]
> __rcu *curr
>    kernel/sched/deadline.c:2880:23: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@     expected struct task_struct *p @@     got struct
> task_struct [noderef] __rcu *donor @@
>    kernel/sched/deadline.c:2880:23: sparse:     expected struct task_struct *p
>    kernel/sched/deadline.c:2880:23: sparse:     got struct task_struct [noderef]
> __rcu *donor
>    kernel/sched/deadline.c:2882:44: sparse: sparse: incorrect type in argument 2
> (different address spaces) @@     expected struct sched_dl_entity const *b @@
> got struct sched_dl_entity [noderef] __rcu * @@
>    kernel/sched/deadline.c:2882:44: sparse:     expected struct sched_dl_entity
> const *b
>    kernel/sched/deadline.c:2882:44: sparse:     got struct sched_dl_entity
> [noderef] __rcu *
>    kernel/sched/deadline.c:3071:23: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/deadline.c:3071:23: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/deadline.c:3071:23: sparse:    struct task_struct *
>    kernel/sched/deadline.c:3120:32: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@     expected struct task_struct *p @@     got struct
> task_struct [noderef] __rcu *curr @@
>    kernel/sched/build_policy.c: note: in included file:
>    kernel/sched/syscalls.c:206:22: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/syscalls.c:206:22: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/syscalls.c:206:22: sparse:    struct task_struct *
>    kernel/sched/build_policy.c: note: in included file:
>    kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/sched.h:2276:25: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/sched.h:2276:25: sparse:    struct task_struct *
>    kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/sched.h:2276:25: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/sched.h:2276:25: sparse:    struct task_struct *
>    kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/sched.h:2287:26: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/sched.h:2287:26: sparse:    struct task_struct *
>    kernel/sched/build_policy.c: note: in included file:
>    kernel/sched/rt.c:2413:45: sparse: sparse: dereference of noderef expression
>    kernel/sched/build_policy.c: note: in included file:
> >> kernel/sched/sched.h:2627:35: sparse: sparse: incorrect type in initializer
This warning is not about the change we made, @lkp, could you please check it?

> (different address spaces) @@     expected struct task_struct *p @@     got struct
> task_struct [noderef] __rcu *donor @@
>    kernel/sched/build_policy.c: note: in included file:
>    kernel/sched/rt.c:2456:32: sparse: sparse: dereference of noderef expression
>    kernel/sched/rt.c:2457:32: sparse: sparse: dereference of noderef expression
>    kernel/sched/build_policy.c: note: in included file:
>    kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/sched.h:2276:25: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/sched.h:2276:25: sparse:    struct task_struct *
>    kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/sched.h:2287:26: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/sched.h:2287:26: sparse:    struct task_struct *
>    kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/sched.h:2276:25: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/sched.h:2276:25: sparse:    struct task_struct *
>    kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/sched.h:2276:25: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/sched.h:2276:25: sparse:    struct task_struct *
>    kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/sched.h:2287:26: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/sched.h:2287:26: sparse:    struct task_struct *
>    kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/sched.h:2287:26: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/sched.h:2287:26: sparse:    struct task_struct *
>    kernel/sched/sched.h:2476:9: sparse: sparse: incompatible types in comparison
> expression (different address spaces):
>    kernel/sched/sched.h:2476:9: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/sched.h:2476:9: sparse:    struct task_struct *
>    kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
>    kernel/sched/sched.h:2287:26: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/sched.h:2287:26: sparse:    struct task_struct *
>    kernel/sched/sched.h:2476:9: sparse: sparse: incompatible types in comparison
> expression (different address spaces):
>    kernel/sched/sched.h:2476:9: sparse:    struct task_struct [noderef] __rcu *
>    kernel/sched/sched.h:2476:9: sparse:    struct task_struct *
>    kernel/sched/build_policy.c: note: in included file:
>    kernel/sched/syscalls.c:1296:6: sparse: sparse: context imbalance in
> 'sched_getaffinity' - different lock contexts for basic block
>    kernel/sched/build_policy.c: note: in included file:
>    kernel/sched/rt.c:1767:15: sparse: sparse: dereference of noderef expression
> 
> vim +2627 kernel/sched/sched.h
> 
> 04746ed80bcf31 Ingo Molnar               2024-04-07  2624
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2625  static inline struct
> task_struct *get_push_task(struct rq *rq)
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2626  {
> af0c8b2bf67b25 Peter Zijlstra            2024-10-09 @2627  	struct task_struct *p =
> rq->donor;
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2628
> 5cb9eaa3d274f7 Peter Zijlstra            2020-11-17  2629
> 	lockdep_assert_rq_held(rq);
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2630
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2631  	if (rq->push_busy)
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2632  		return NULL;
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2633
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2634  	if (p->nr_cpus_allowed
> == 1)
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2635  		return NULL;
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2636
> e681dcbaa4b284 Sebastian Andrzej Siewior 2021-08-26  2637  	if (p-
> >migration_disabled)
> e681dcbaa4b284 Sebastian Andrzej Siewior 2021-08-26  2638  		return
> NULL;
> e681dcbaa4b284 Sebastian Andrzej Siewior 2021-08-26  2639
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2640  	rq->push_busy = true;
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2641  	return
> get_task_struct(p);
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2642  }
> a7c81556ec4d34 Peter Zijlstra            2020-09-28  2643
> 
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
  2025-07-08 14:02     ` Deng, Pan
@ 2025-07-09  8:56       ` Li, Philip
  0 siblings, 0 replies; 16+ messages in thread
From: Li, Philip @ 2025-07-09  8:56 UTC (permalink / raw)
  To: Deng, Pan, lkp, peterz@infradead.org, mingo@kernel.org
  Cc: oe-kbuild-all@lists.linux.dev, linux-kernel@vger.kernel.org,
	Li, Tianyou, tim.c.chen@linux.intel.com, Chen, Yu C

> > comparison expression (different address spaces):
> >    kernel/sched/sched.h:2276:25: sparse:    struct task_struct [noderef] __rcu *
> >    kernel/sched/sched.h:2276:25: sparse:    struct task_struct *
> >    kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in
> > comparison expression (different address spaces):
> >    kernel/sched/sched.h:2276:25: sparse:    struct task_struct [noderef] __rcu *
> >    kernel/sched/sched.h:2276:25: sparse:    struct task_struct *
> >    kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in
> > comparison expression (different address spaces):
> >    kernel/sched/sched.h:2287:26: sparse:    struct task_struct [noderef] __rcu *
> >    kernel/sched/sched.h:2287:26: sparse:    struct task_struct *
> >    kernel/sched/build_policy.c: note: in included file:
> >    kernel/sched/rt.c:2413:45: sparse: sparse: dereference of noderef expression
> >    kernel/sched/build_policy.c: note: in included file:
> > >> kernel/sched/sched.h:2627:35: sparse: sparse: incorrect type in initializer
> This warning is not about the change we made, @lkp, could you please check it?

Sorry for this false report, it should not be related to your changes. We will follow
up to figure out what is wrong during the bisections. Sorry for wasting your time.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2025-07-07  2:35 ` [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
@ 2025-07-21 11:23   ` Chen, Yu C
  2025-07-22 14:46     ` Deng, Pan
  0 siblings, 1 reply; 16+ messages in thread
From: Chen, Yu C @ 2025-07-21 11:23 UTC (permalink / raw)
  To: Pan Deng; +Cc: linux-kernel, tianyou.li, tim.c.chen, peterz, mingo

On 7/7/2025 10:35 AM, Pan Deng wrote:
> When running a multi-instance FFmpeg workload on HCC system, significant
> contention is observed on bitmap of `cpupri_vec->cpumask`.
> 
> The SUT is a 2-socket machine with 240 physical cores and 480 logical
> CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
> (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> with FIFO scheduling. FPS is used as score.
> 
> perf c2c tool reveals:
> cpumask (bitmap) cache line of `cpupri_vec->mask`:
> - bits are loaded during cpupri_find
> - bits are stored during cpupri_set
> - cycles per load: ~2.2K to 8.7K
> 
> This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
> mitigate false sharing.
> 
> As a result:
> - FPS improves by ~3.8%
> - Kernel cycles% drops from ~20% to ~18.7%
> - Cache line contention is mitigated, perf-c2c shows cycles per load
>    drops from ~2.2K-8.7K to ~0.5K-2.2K
> 

This brings noticeable improvement for RT workload, and it would
be even more convincing if we can have try on normal task workload,
at least not bring regression(schbench/hackbenc, etc).

thanks,
Chenyu

> Note: CONFIG_CPUMASK_OFFSTACK=n remains unchanged.
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2025-07-21 11:23   ` Chen, Yu C
@ 2025-07-22 14:46     ` Deng, Pan
  2025-08-06 14:00       ` Deng, Pan
  0 siblings, 1 reply; 16+ messages in thread
From: Deng, Pan @ 2025-07-22 14:46 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: linux-kernel@vger.kernel.org, Li, Tianyou,
	tim.c.chen@linux.intel.com, peterz@infradead.org,
	mingo@kernel.org


> -----Original Message-----
> From: Chen, Yu C <yu.c.chen@intel.com>
> Sent: Monday, July 21, 2025 7:24 PM
> To: Deng, Pan <pan.deng@intel.com>
> Cc: linux-kernel@vger.kernel.org; Li, Tianyou <tianyou.li@intel.com>;
> tim.c.chen@linux.intel.com; peterz@infradead.org; mingo@kernel.org
> Subject: Re: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA
> node to reduce contention
> 
> On 7/7/2025 10:35 AM, Pan Deng wrote:
> > When running a multi-instance FFmpeg workload on HCC system,
> > significant contention is observed on bitmap of `cpupri_vec->cpumask`.
> >
> > The SUT is a 2-socket machine with 240 physical cores and 480 logical
> > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical
> > cores
> > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> > with FIFO scheduling. FPS is used as score.
> >
> > perf c2c tool reveals:
> > cpumask (bitmap) cache line of `cpupri_vec->mask`:
> > - bits are loaded during cpupri_find
> > - bits are stored during cpupri_set
> > - cycles per load: ~2.2K to 8.7K
> >
> > This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
> > mitigate false sharing.
> >
> > As a result:
> > - FPS improves by ~3.8%
> > - Kernel cycles% drops from ~20% to ~18.7%
> > - Cache line contention is mitigated, perf-c2c shows cycles per load
> >    drops from ~2.2K-8.7K to ~0.5K-2.2K
> >
> 
> This brings noticeable improvement for RT workload, and it would be even
> more convincing if we can have try on normal task workload, at least not bring
> regression(schbench/hackbenc, etc).
>

Thanks Yu, hackbench and schbench data will be provided later.
 

> thanks,
> Chenyu
> 
> > Note: CONFIG_CPUMASK_OFFSTACK=n remains unchanged.
> >
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
  2025-07-22 14:46     ` Deng, Pan
@ 2025-08-06 14:00       ` Deng, Pan
  0 siblings, 0 replies; 16+ messages in thread
From: Deng, Pan @ 2025-08-06 14:00 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: linux-kernel@vger.kernel.org, Li, Tianyou,
	tim.c.chen@linux.intel.com, peterz@infradead.org,
	mingo@kernel.org

> -----Original Message-----
> From: Deng, Pan
> Sent: Tuesday, July 22, 2025 10:47 PM
> To: Chen, Yu C <yu.c.chen@intel.com>
> Cc: linux-kernel@vger.kernel.org; Li, Tianyou <tianyou.li@intel.com>;
> tim.c.chen@linux.intel.com; peterz@infradead.org; mingo@kernel.org
> Subject: RE: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA
> node to reduce contention
> 
> 
> > -----Original Message-----
> > From: Chen, Yu C <yu.c.chen@intel.com>
> > Sent: Monday, July 21, 2025 7:24 PM
> > To: Deng, Pan <pan.deng@intel.com>
> > Cc: linux-kernel@vger.kernel.org; Li, Tianyou <tianyou.li@intel.com>;
> > tim.c.chen@linux.intel.com; peterz@infradead.org; mingo@kernel.org
> > Subject: Re: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA
> > node to reduce contention
> >
> > On 7/7/2025 10:35 AM, Pan Deng wrote:
> > > When running a multi-instance FFmpeg workload on HCC system,
> > > significant contention is observed on bitmap of `cpupri_vec->cpumask`.
> > >
> > > The SUT is a 2-socket machine with 240 physical cores and 480 logical
> > > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical
> > > cores
> > > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> > > with FIFO scheduling. FPS is used as score.
> > >
> > > perf c2c tool reveals:
> > > cpumask (bitmap) cache line of `cpupri_vec->mask`:
> > > - bits are loaded during cpupri_find
> > > - bits are stored during cpupri_set
> > > - cycles per load: ~2.2K to 8.7K
> > >
> > > This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
> > > mitigate false sharing.
> > >
> > > As a result:
> > > - FPS improves by ~3.8%
> > > - Kernel cycles% drops from ~20% to ~18.7%
> > > - Cache line contention is mitigated, perf-c2c shows cycles per load
> > >    drops from ~2.2K-8.7K to ~0.5K-2.2K
> > >
> >
> > This brings noticeable improvement for RT workload, and it would be even
> > more convincing if we can have try on normal task workload, at least not
> bring
> > regression(schbench/hackbenc, etc).
> >
> 
> Thanks Yu, hackbench and schbench data will be provided later.
> 
> 

TLDR;
====
Hackbench, both old and new version schbench were evaluted on SUT with
2-socket/6 NUMA nodes/240 physical cores/480 logical CPUs. No regressions
were detected for patch 1-4. In addition, symbol-level analysis from
`perf record -a` profiling data indicates that the changes introduced
in patch 1-4 are unlikely to cause regressions in hackbench or schbench.

Details
=======

Hackbench
=========
The workload is ran by test framework
https://github.com/yu-chen-surf/schedtests, procedures:
1. Reboot system to run a workload.
2. Run 5 iterations of 1st configuration with 30s cool down period.
3. Run 5 iterations of 2nd configuration..
...

The test results are as follows: regressions exceeding -10% are marked
with ** at the end of the line. However, when re-running the tests using
the test framework or a vanilla workload, the regressions could not be
reproduced.

Notes: 15/30/45 are fds# as well as process/thread pairs in 1 group.
Patch 1
case                              load             baseline(std%)  patch1%( std%)
process-pipe-15         1-groups         1.00 ( 14.03)   -8.81 (  6.53)
process-pipe-15         2-groups         1.00 (  3.46)   +1.82 (  2.59)
process-pipe-15         4-groups         1.00 (  6.20)   +8.60 (  5.59)
process-pipe-15         8-groups         1.00 (  2.41)   -0.21 (  3.22)
process-pipe-30         1-groups         1.00 (  2.51)   +2.24 (  3.12)
process-pipe-30         2-groups         1.00 (  3.86)   -0.58 (  2.46)
process-pipe-30         4-groups         1.00 (  2.19)   -1.81 (  1.05)
process-pipe-30         8-groups         1.00 (  1.69)   +0.52 (  3.01)
process-pipe-45         1-groups         1.00 (  1.63)   +1.63 (  1.23)
process-pipe-45         2-groups         1.00 (  0.79)   +0.08 (  1.82)
process-pipe-45         4-groups         1.00 (  1.62)   -0.06 (  0.63)
process-pipe-45         8-groups         1.00 (  1.66)   -4.12 (  3.27)
process-sockets-15      1-groups         1.00 (  3.57)   +2.36 (  5.15)
process-sockets-15      2-groups         1.00 (  3.59)   -1.33 (  6.86)
process-sockets-15      4-groups         1.00 (  7.10)   +5.44 (  6.97)
process-sockets-15      8-groups         1.00 (  2.63)   -3.05 (  1.94)
process-sockets-30      1-groups         1.00 (  3.73)   -2.69 (  4.89)
process-sockets-30      2-groups         1.00 (  3.90)   -4.25 (  3.94)
process-sockets-30      4-groups         1.00 (  1.03)   -1.58 (  1.51)
process-sockets-30      8-groups         1.00 (  0.48)   +1.09 (  0.68)
process-sockets-45      1-groups         1.00 (  0.62)   -2.25 (  0.57)
process-sockets-45      2-groups         1.00 (  2.56)   -0.61 (  0.63)
process-sockets-45      4-groups         1.00 (  0.57)   -0.51 (  0.79)
process-sockets-45      8-groups         1.00 (  0.18)   -5.23 (  2.18)
threads-pipe-15         1-groups         1.00 (  5.30)   -1.47 (  5.38)
threads-pipe-15         2-groups         1.00 (  7.97)   -1.31 (  8.61)
threads-pipe-15         4-groups         1.00 (  4.94)   -3.31 (  5.48)
threads-pipe-15         8-groups         1.00 (  1.69)   +7.28 (  5.54)
threads-pipe-30         1-groups         1.00 (  5.12)   -1.58 (  4.82)
threads-pipe-30         2-groups         1.00 (  1.63)   +3.29 (  1.72)
threads-pipe-30         4-groups         1.00 (  3.41)   +3.05 (  3.22)
threads-pipe-30         8-groups         1.00 (  2.85)   +1.58 (  4.05)
threads-pipe-45         1-groups         1.00 (  5.13)   -0.78 (  6.78)
threads-pipe-45         2-groups         1.00 (  1.92)   -2.87 (  1.27)
threads-pipe-45         4-groups         1.00 (  2.41)   -4.37 (  1.23)
threads-pipe-45         8-groups         1.00 (  1.81)   +1.85 (  1.54)
threads-sockets-15      1-groups         1.00 (  4.72)   -0.73 (  2.75)
threads-sockets-15      2-groups         1.00 (  3.05)   +3.09 (  3.39)
threads-sockets-15      4-groups         1.00 (  5.92)   +0.87 (  2.25)
threads-sockets-15      8-groups         1.00 (  3.75)   -7.24 (  3.34)
threads-sockets-30      1-groups         1.00 (  5.96)   -6.27 (  3.35)
threads-sockets-30      2-groups         1.00 (  1.68)   -1.78 (  3.60)
threads-sockets-30      4-groups         1.00 (  5.02)   -0.95 (  3.60)
threads-sockets-30      8-groups         1.00 (  0.41)   -3.09 (  2.03)
threads-sockets-45      1-groups         1.00 (  2.55)   -1.32 (  1.37)
threads-sockets-45      2-groups         1.00 (  3.53)   -0.46 (  3.99)
threads-sockets-45      4-groups         1.00 (  0.51)   +0.67 (  0.74)
threads-sockets-45      8-groups         1.00 (  3.01)  -16.85 (  2.13) **

Patch 2
case                              load             baseline(std%)  patch2%( std%)
process-pipe-15         1-groups         1.00 ( 14.03)   -3.32 ( 11.34)
process-pipe-15         2-groups         1.00 (  3.46)   +2.19 (  7.27)
process-pipe-15         4-groups         1.00 (  6.20)   +2.01 (  2.83)
process-pipe-15         8-groups         1.00 (  2.41)   +1.65 (  4.39)
process-pipe-30         1-groups         1.00 (  2.51)   -0.88 (  3.26)
process-pipe-30         2-groups         1.00 (  3.86)   +2.25 (  3.21)
process-pipe-30         4-groups         1.00 (  2.19)   +0.20 (  1.72)
process-pipe-30         8-groups         1.00 (  1.69)   +0.85 (  0.61)
process-pipe-45         1-groups         1.00 (  1.63)   +3.10 (  4.01)
process-pipe-45         2-groups         1.00 (  0.79)   -1.00 (  0.69)
process-pipe-45         4-groups         1.00 (  1.62)   +0.07 (  0.63)
process-pipe-45         8-groups         1.00 (  1.66)   +0.20 (  1.47)
process-sockets-15      1-groups         1.00 (  3.57)   -5.44 (  3.45)
process-sockets-15      2-groups         1.00 (  3.59)   +1.00 (  4.35)
process-sockets-15      4-groups         1.00 (  7.10)   +0.46 (  4.45)
process-sockets-15      8-groups         1.00 (  2.63)   -1.48 (  4.56)
process-sockets-30      1-groups         1.00 (  3.73)   -0.17 (  3.57)
process-sockets-30      2-groups         1.00 (  3.90)   +3.83 (  7.54)
process-sockets-30      4-groups         1.00 (  1.03)   -2.35 (  6.11)
process-sockets-30      8-groups         1.00 (  0.48)   -0.43 (  0.79)
process-sockets-45      1-groups         1.00 (  0.62)   -2.24 (  1.63)
process-sockets-45      2-groups         1.00 (  2.56)   -1.41 (  3.17)
process-sockets-45      4-groups         1.00 (  0.57)   -0.29 (  0.33)
process-sockets-45      8-groups         1.00 (  0.18)   -6.05 (  1.55)
threads-pipe-15         1-groups         1.00 (  5.30)   -5.83 (  7.96)
threads-pipe-15         2-groups         1.00 (  7.97)   -3.74 (  4.22)
threads-pipe-15         4-groups         1.00 (  4.94)   -2.23 (  5.75)
threads-pipe-15         8-groups         1.00 (  1.69)   +0.21 (  3.08)
threads-pipe-30         1-groups         1.00 (  5.12)   -5.73 (  4.97)
threads-pipe-30         2-groups         1.00 (  1.63)   -1.76 (  4.49)
threads-pipe-30         4-groups         1.00 (  3.41)   -0.99 (  2.50)
threads-pipe-30         8-groups         1.00 (  2.85)   +0.71 (  1.04)
threads-pipe-45         1-groups         1.00 (  5.13)   +0.08 (  5.72)
threads-pipe-45         2-groups         1.00 (  1.92)   -1.78 (  1.30)
threads-pipe-45         4-groups         1.00 (  2.41)   -3.79 (  0.81)
threads-pipe-45         8-groups         1.00 (  1.81)   -3.62 (  1.41)
threads-sockets-15      1-groups         1.00 (  4.72)   +2.52 (  2.66)
threads-sockets-15      2-groups         1.00 (  3.05)   -7.59 (  1.80)
threads-sockets-15      4-groups         1.00 (  5.92)   +1.59 (  7.12)
threads-sockets-15      8-groups         1.00 (  3.75)   -0.34 (  3.62)
threads-sockets-30      1-groups         1.00 (  5.96)   -2.45 (  4.89)
threads-sockets-30      2-groups         1.00 (  1.68)   -0.61 (  4.80)
threads-sockets-30      4-groups         1.00 (  5.02)   -2.15 (  8.62)
threads-sockets-30      8-groups         1.00 (  0.41)  -17.32 (  0.88) **
threads-sockets-45      1-groups         1.00 (  2.55)   -3.24 (  3.37)
threads-sockets-45      2-groups         1.00 (  3.53)   -1.38 (  2.40)
threads-sockets-45      4-groups         1.00 (  0.51)   -0.17 (  0.85)
threads-sockets-45      8-groups         1.00 (  3.01)  -14.59 (  5.48) **

Patch 3
case                              load             baseline(std%)  patch3%( std%)
process-pipe-15         1-groups         1.00 ( 14.03)  -10.18 (  3.39) **
process-pipe-15         2-groups         1.00 (  3.46)   +5.18 (  3.12)
process-pipe-15         4-groups         1.00 (  6.20)   +8.63 (  5.72)
process-pipe-15         8-groups         1.00 (  2.41)   +5.37 (  2.24)
process-pipe-30         1-groups         1.00 (  2.51)   +5.53 (  3.55)
process-pipe-30         2-groups         1.00 (  3.86)   +5.70 (  4.27)
process-pipe-30         4-groups         1.00 (  2.19)   +3.95 (  3.34)
process-pipe-30         8-groups         1.00 (  1.69)   -3.38 (  1.51)
process-pipe-45         1-groups         1.00 (  1.63)   +5.19 (  2.51)
process-pipe-45         2-groups         1.00 (  0.79)   -0.63 (  2.06)
process-pipe-45         4-groups         1.00 (  1.62)   -5.83 (  2.22)
process-pipe-45         8-groups         1.00 (  1.66)   -6.13 (  2.34)
process-sockets-15      1-groups         1.00 (  3.57)   -1.51 (  4.21)
process-sockets-15      2-groups         1.00 (  3.59)   -1.30 (  7.50)
process-sockets-15      4-groups         1.00 (  7.10)   -1.80 (  5.58)
process-sockets-15      8-groups         1.00 (  2.63)   -1.68 (  3.40)
process-sockets-30      1-groups         1.00 (  3.73)   -7.74 (  1.58)
process-sockets-30      2-groups         1.00 (  3.90)   -1.98 (  5.48)
process-sockets-30      4-groups         1.00 (  1.03)   -0.33 (  3.47)
process-sockets-30      8-groups         1.00 (  0.48)   -0.40 (  0.84)
process-sockets-45      1-groups         1.00 (  0.62)   -0.21 (  0.54)
process-sockets-45      2-groups         1.00 (  2.56)   -1.97 (  2.48)
process-sockets-45      4-groups         1.00 (  0.57)   -0.61 (  0.83)
process-sockets-45      8-groups         1.00 (  0.18)   -5.09 (  1.85)
threads-pipe-15         1-groups         1.00 (  5.30)   +3.62 ( 11.04)
threads-pipe-15         2-groups         1.00 (  7.97)   +8.08 (  4.63)
threads-pipe-15         4-groups         1.00 (  4.94)   +6.46 (  5.27)
threads-pipe-15         8-groups         1.00 (  1.69)   +2.68 (  3.23)
threads-pipe-30         1-groups         1.00 (  5.12)   +3.60 (  7.09)
threads-pipe-30         2-groups         1.00 (  1.63)   -0.80 (  4.43)
threads-pipe-30         4-groups         1.00 (  3.41)   +2.37 (  2.16)
threads-pipe-30         8-groups         1.00 (  2.85)   +4.17 (  1.41)
threads-pipe-45         1-groups         1.00 (  5.13)   +7.41 (  4.48)
threads-pipe-45         2-groups         1.00 (  1.92)   -1.40 (  2.69)
threads-pipe-45         4-groups         1.00 (  2.41)   -1.25 (  2.15)
threads-pipe-45         8-groups         1.00 (  1.81)   +1.62 (  0.73)
threads-sockets-15      1-groups         1.00 (  4.72)  +10.11 (  7.95)
threads-sockets-15      2-groups         1.00 (  3.05)   -8.41 (  5.93)
threads-sockets-15      4-groups         1.00 (  5.92)  -10.89 (  4.29) **
threads-sockets-15      8-groups         1.00 (  3.75)   -7.66 (  3.33)
threads-sockets-30      1-groups         1.00 (  5.96)   -5.18 (  2.77)
threads-sockets-30      2-groups         1.00 (  1.68)   -4.91 (  3.89)
threads-sockets-30      4-groups         1.00 (  5.02)   -6.32 (  4.19)
threads-sockets-30      8-groups         1.00 (  0.41)  -11.73 (  0.96) **
threads-sockets-45      1-groups         1.00 (  2.55)   -3.16 (  1.97)
threads-sockets-45      2-groups         1.00 (  3.53)   -0.21 (  4.33)
threads-sockets-45      4-groups         1.00 (  0.51)   -0.75 (  2.07)
threads-sockets-45      8-groups         1.00 (  3.01)  -20.52 (  1.44) **

Patch 4
case                              load             baseline(std%)  patch4%( std%)
process-pipe-15         1-groups         1.00 ( 14.03)   -2.68 (  9.64)
process-pipe-15         2-groups         1.00 (  3.46)   +1.82 (  7.55)
process-pipe-15         4-groups         1.00 (  6.20)   +3.67 (  8.17)
process-pipe-15         8-groups         1.00 (  2.41)   +1.87 (  0.92)
process-pipe-30         1-groups         1.00 (  2.51)   -3.34 (  3.96)
process-pipe-30         2-groups         1.00 (  3.86)   -0.33 (  3.53)
process-pipe-30         4-groups         1.00 (  2.19)   -3.22 (  1.31)
process-pipe-30         8-groups         1.00 (  1.69)   -1.95 (  1.07)
process-pipe-45         1-groups         1.00 (  1.63)   +0.63 (  2.86)
process-pipe-45         2-groups         1.00 (  0.79)   -1.27 (  1.39)
process-pipe-45         4-groups         1.00 (  1.62)   -2.04 (  1.87)
process-pipe-45         8-groups         1.00 (  1.66)   -1.45 (  3.20)
process-sockets-15      1-groups         1.00 (  3.57)   -9.16 (  5.33)
process-sockets-15      2-groups         1.00 (  3.59)   -1.83 (  5.36)
process-sockets-15      4-groups         1.00 (  7.10)   +7.55 (  6.34)
process-sockets-15      8-groups         1.00 (  2.63)   -2.98 (  5.95)
process-sockets-30      1-groups         1.00 (  3.73)   +3.50 (  4.92)
process-sockets-30      2-groups         1.00 (  3.90)   +1.80 (  5.68)
process-sockets-30      4-groups         1.00 (  1.03)   -1.23 (  4.79)
process-sockets-30      8-groups         1.00 (  0.48)   -0.15 (  0.33)
process-sockets-45      1-groups         1.00 (  0.62)   -0.70 (  1.12)
process-sockets-45      2-groups         1.00 (  2.56)   +0.64 (  0.86)
process-sockets-45      4-groups         1.00 (  0.57)   +0.09 (  0.53)
process-sockets-45      8-groups         1.00 (  0.18)   -7.31 (  2.11)
threads-pipe-15         1-groups         1.00 (  5.30)   +4.94 (  9.52)
threads-pipe-15         2-groups         1.00 (  7.97)   -4.28 (  2.30)
threads-pipe-15         4-groups         1.00 (  4.94)   -1.83 (  4.24)
threads-pipe-15         8-groups         1.00 (  1.69)   -2.35 (  1.50)
threads-pipe-30         1-groups         1.00 (  5.12)   +2.06 (  5.00)
threads-pipe-30         2-groups         1.00 (  1.63)   +0.93 (  4.53)
threads-pipe-30         4-groups         1.00 (  3.41)   -2.85 (  3.20)
threads-pipe-30         8-groups         1.00 (  2.85)   -2.20 (  2.68)
threads-pipe-45         1-groups         1.00 (  5.13)   -0.97 (  4.70)
threads-pipe-45         2-groups         1.00 (  1.92)   -2.11 (  1.21)
threads-pipe-45         4-groups         1.00 (  2.41)   -2.69 (  1.33)
threads-pipe-45         8-groups         1.00 (  1.81)   -2.41 (  1.14)
threads-sockets-15      1-groups         1.00 (  4.72)   +0.82 (  4.21)
threads-sockets-15      2-groups         1.00 (  3.05)   -1.28 (  2.48)
threads-sockets-15      4-groups         1.00 (  5.92)   -1.75 (  7.25)
threads-sockets-15      8-groups         1.00 (  3.75)   -2.54 (  3.49)
threads-sockets-30      1-groups         1.00 (  5.96)   -0.46 (  5.30)
threads-sockets-30      2-groups         1.00 (  1.68)   -0.45 (  1.75)
threads-sockets-30      4-groups         1.00 (  5.02)   -1.48 (  6.51)
threads-sockets-30      8-groups         1.00 (  0.41)  -13.09 (  1.61) **
threads-sockets-45      1-groups         1.00 (  2.55)   -1.68 (  0.66)
threads-sockets-45      2-groups         1.00 (  3.53)   +0.21 (  2.23)
threads-sockets-45      4-groups         1.00 (  0.51)   -1.27 (  1.43)
threads-sockets-45      8-groups         1.00 (  3.01)   -3.41 (  0.43)

Additionally, profiling data was collected using `perf record -a` for this
workload. Firstly, the cycles distribution are almost the same among
baseline and patch1-4. Secondly, the patch1-4 relevant symbols identified
were set_rd_overloaded/set_rd_overutilized, which is potentially invoked
(actually inlined) by `update_sd_lb_stats`. The `update_sd_lb_stats`
itself takes ~2.6% cycles in the baseline configuration
threads-sockets-45fd, 8 groups, while no regressions were observed in
patches 1-4 about this function. So I think the patches won't cause
regressions to hackbench.

Schbench(old, 91ea787)
======================
The workload is ran by the same metholody as hackbench, with runtime 100s.

Test result is as following, the regression over -5% is marked
with ** at the end of the line, while, when re-run the test with either
test framework or vanilla workload, the regression cannot re-produced.

case                         load                     baseline(std%)  opt1%( std%)
normal       1-mthreads-8-workers     1.00 (  1.44)   -5.60 (  2.96) **
normal       1-mthreads-2-workers     1.00 (  2.79)   -2.65 (  5.48)
normal       1-mthreads-1-workers     1.00 (  1.27)   -1.60 (  1.03)
normal       1-mthreads-31-workers    1.00 (  1.30)   -0.87 (  2.34)
normal       1-mthreads-16-workers    1.00 (  1.74)   -2.23 (  1.15)
normal       1-mthreads-4-workers     1.00 (  3.35)   -1.92 (  1.62)
normal       2-mthreads-8-workers     1.00 (  2.17)   -2.09 (  1.38)
normal       2-mthreads-31-workers    1.00 (  1.83)   +1.93 (  1.84)
normal       2-mthreads-16-workers    1.00 (  2.06)   +0.36 (  2.38)
normal       2-mthreads-1-workers     1.00 (  3.86)   +0.50 (  2.46)
normal       2-mthreads-2-workers     1.00 (  1.76)   -6.91 (  2.55)
normal       2-mthreads-4-workers     1.00 (  1.59)   -5.58 (  5.99)
normal       4-mthreads-8-workers     1.00 (  0.85)   +0.59 (  0.54)
normal       4-mthreads-31-workers    1.00 ( 15.31)  +15.04 ( 12.71)
normal       4-mthreads-16-workers    1.00 (  0.99)   -2.62 (  2.15)
normal       4-mthreads-4-workers     1.00 (  1.42)   -2.72 (  1.70)
normal       4-mthreads-1-workers     1.00 (  1.43)   -2.84 (  1.73)
normal       4-mthreads-2-workers     1.00 (  1.78)   -4.28 (  2.08)
normal       8-mthreads-16-workers    1.00 ( 10.04)   +7.06 (  0.73)
normal       8-mthreads-31-workers    1.00 (  1.94)   -1.66 (  2.28)
normal       8-mthreads-2-workers     1.00 (  2.51)   -0.30 (  1.53)
normal       8-mthreads-8-workers     1.00 (  1.56)   -1.83 (  1.39)
normal       8-mthreads-1-workers     1.00 (  4.08)   +0.45 (  1.45)
normal       8-mthreads-4-workers     1.00 (  1.84)   +2.85 (  1.07)

case                         load                     baseline(std%)  opt2%( std%)
normal       1-mthreads-8-workers     1.00 (  1.44)   -1.48 (  3.79)
normal       1-mthreads-2-workers     1.00 (  2.79)   +3.32 (  0.90)
normal       1-mthreads-1-workers     1.00 (  1.27)   +1.98 (  1.02)
normal       1-mthreads-31-workers    1.00 (  1.30)   +5.84 (  3.01)
normal       1-mthreads-16-workers    1.00 (  1.74)   +5.90 (  0.68)
normal       1-mthreads-4-workers     1.00 (  3.35)   +1.82 (  1.65)
normal       2-mthreads-8-workers     1.00 (  2.17)   +2.80 (  2.04)
normal       2-mthreads-31-workers    1.00 (  1.83)   -0.07 (  1.09)
normal       2-mthreads-16-workers    1.00 (  2.06)   +2.45 (  2.55)
normal       2-mthreads-1-workers     1.00 (  3.86)   +2.41 (  2.92)
normal       2-mthreads-2-workers     1.00 (  1.76)   -1.29 (  2.03)
normal       2-mthreads-4-workers     1.00 (  1.59)   +0.44 (  1.15)
normal       4-mthreads-8-workers     1.00 (  0.85)   -0.81 (  3.03)
normal       4-mthreads-31-workers    1.00 ( 15.31)   +2.06 ( 15.97)
normal       4-mthreads-16-workers    1.00 (  0.99)   -1.46 (  2.29)
normal       4-mthreads-4-workers     1.00 (  1.42)   -0.15 (  3.37)
normal       4-mthreads-1-workers     1.00 (  1.43)   +0.97 (  1.95)
normal       4-mthreads-2-workers     1.00 (  1.78)   -0.38 (  2.53)
normal       8-mthreads-16-workers    1.00 ( 10.04)   +5.80 (  1.72)
normal       8-mthreads-31-workers    1.00 (  1.94)   -0.76 (  2.33)
normal       8-mthreads-2-workers     1.00 (  2.51)   +2.47 (  2.17)
normal       8-mthreads-8-workers     1.00 (  1.56)   -0.66 (  1.47)
normal       8-mthreads-1-workers     1.00 (  4.08)   +2.71 (  2.78)
normal       8-mthreads-4-workers     1.00 (  1.84)   +2.35 (  4.88)

case                          load                     baseline(std%)  opt3%( std%)
normal       1-mthreads-8-workers     1.00 (  1.44)   -6.90 (  3.85)  **
normal       1-mthreads-2-workers     1.00 (  2.79)   +3.23 (  3.09)
normal       1-mthreads-1-workers     1.00 (  1.27)   -1.04 (  2.22)
normal       1-mthreads-31-workers    1.00 (  1.30)   +2.16 (  1.64)
normal       1-mthreads-16-workers    1.00 (  1.74)   -0.72 (  5.70)
normal       1-mthreads-4-workers     1.00 (  3.35)   -1.92 (  4.31)
normal       2-mthreads-8-workers     1.00 (  2.17)   +0.82 (  1.90)
normal       2-mthreads-31-workers    1.00 (  1.83)   +2.08 (  1.16)
normal       2-mthreads-16-workers    1.00 (  2.06)   +4.04 (  2.42)
normal       2-mthreads-1-workers     1.00 (  3.86)   +2.57 (  3.44)
normal       2-mthreads-2-workers     1.00 (  1.76)   -0.12 (  1.29)
normal       2-mthreads-4-workers     1.00 (  1.59)   -2.04 (  2.83)
normal       4-mthreads-8-workers     1.00 (  0.85)   +0.22 (  1.65)
normal       4-mthreads-31-workers    1.00 ( 15.31)  +15.09 (  9.83)
normal       4-mthreads-16-workers    1.00 (  0.99)   +1.46 (  1.88)
normal       4-mthreads-4-workers     1.00 (  1.42)   +2.34 (  1.57)
normal       4-mthreads-1-workers     1.00 (  1.43)   -0.77 (  2.45)
normal       4-mthreads-2-workers     1.00 (  1.78)   -1.16 (  1.85)
normal       8-mthreads-16-workers    1.00 ( 10.04)   +7.39 (  1.65)
normal       8-mthreads-31-workers    1.00 (  1.94)   -0.81 (  2.14)
normal       8-mthreads-2-workers     1.00 (  2.51)   -1.93 (  2.00)
normal       8-mthreads-8-workers     1.00 (  1.56)   +1.17 (  1.40)
normal       8-mthreads-1-workers     1.00 (  4.08)   +1.63 (  0.51)
normal       8-mthreads-4-workers     1.00 (  1.84)   +4.77 (  2.36)

case                          load                     baseline(std%)  opt4%( std%)
normal       1-mthreads-8-workers     1.00 (  1.44)   -0.27 (  3.05)
normal       1-mthreads-2-workers     1.00 (  2.79)   -0.31 (  1.19)
normal       1-mthreads-1-workers     1.00 (  1.27)   +1.62 (  1.77)
normal       1-mthreads-31-workers    1.00 (  1.30)   +1.30 (  3.34)
normal       1-mthreads-16-workers    1.00 (  1.74)   +0.07 (  3.38)
normal       1-mthreads-4-workers     1.00 (  3.35)   +1.08 (  2.48)
normal       2-mthreads-8-workers     1.00 (  2.17)   +0.04 (  3.87)
normal       2-mthreads-31-workers    1.00 (  1.83)   +1.29 (  1.44)
normal       2-mthreads-16-workers    1.00 (  2.06)   +0.94 (  2.96)
normal       2-mthreads-1-workers     1.00 (  3.86)   +2.85 (  2.12)
normal       2-mthreads-2-workers     1.00 (  1.76)   -0.30 (  2.37)
normal       2-mthreads-4-workers     1.00 (  1.59)   +2.22 (  1.51)
normal       4-mthreads-8-workers     1.00 (  0.85)   +2.20 (  3.06)
normal       4-mthreads-31-workers    1.00 ( 15.31)  +15.65 ( 12.68)
normal       4-mthreads-16-workers    1.00 (  0.99)   -1.96 (  3.30)
normal       4-mthreads-4-workers     1.00 (  1.42)   -1.19 (  3.42)
normal       4-mthreads-1-workers     1.00 (  1.43)   +2.26 (  2.45)
normal       4-mthreads-2-workers     1.00 (  1.78)   -1.36 (  2.75)
normal       8-mthreads-16-workers    1.00 ( 10.04)   -0.33 ( 11.13)
normal       8-mthreads-31-workers    1.00 (  1.94)   -1.14 (  2.01)
normal       8-mthreads-2-workers     1.00 (  2.51)   +2.32 (  2.26)
normal       8-mthreads-8-workers     1.00 (  1.56)   -0.44 (  1.54)
normal       8-mthreads-1-workers     1.00 (  4.08)   +2.17 (  2.10)
normal       8-mthreads-4-workers     1.00 (  1.84)   +3.42 (  2.34)

Again, per perf record data, the cycles distribution are almost the
same among baseline and patch1-4. The symbols related to patches 1-4
are set_rd_overloaded/set_rd_overutilized that is inlined in
`update_sd_lb_stats`, which accounts for ~0.47% (self) cycles in
baseline in 1 message thread and 8 workers configuration, and no
regressions were observed in patches 1-4 about this function.
So I think the patches won't cause regressions to schbench(old).

Schbench(new, 48aed1d)
======================
The workload was executed using the test framework available at
https://github.com/gormanm/mmtests. Each configuration was run for
5 iterations, with a runtime of 100 seconds per iteration. No
significant regressions were observed, as detailed below:

Notes:
1. message threads# are always 6, the same to numa node#
2. 1/2/4/8/16/32/64/79 are worker# per message thread
                                                     baseline     patch1
Amean request-99.0th-qrtle-1      1.00      0.00%
Amean rps-50.0th-qrtle-1              1.00      0.06%
Amean wakeup-99.0th-qrtle-1      1.00      0.26%
Amean request-99.0th-qrtle-2      1.00      0.23%
Amean rps-50.0th-qrtle-2              1.00      0.00%
Amean wakeup-99.0th-qrtle-2      1.00      1.09%
Amean request-99.0th-qrtle-4      1.00     -1.32%
Amean rps-50.0th-qrtle-4              1.00      0.11%
Amean wakeup-99.0th-qrtle-4      1.00     -0.41%
Amean request-99.0th-qrtle-8      1.00     -0.08%
Amean rps-50.0th-qrtle-8              1.00     -0.17%
Amean wakeup-99.0th-qrtle-8      1.00      0.37%
Amean request-99.0th-qrtle-16    1.00      0.23%
Amean rps-50.0th-qrtle-16            1.00     -0.06%
Amean wakeup-99.0th-qrtle-16    1.00      1.03%
Amean request-99.0th-qrtle-32    1.00      0.27%
Amean rps-50.0th-qrtle-32            1.00      0.06%
Amean wakeup-99.0th-qrtle-32    1.00     -0.37%
Amean request-99.0th-qrtle-64    1.00      0.57%
Amean rps-50.0th-qrtle-64            1.00     -0.28%
Amean wakeup-99.0th-qrtle-64    1.00     -3.00%
Amean request-99.0th-qrtle-79    1.00      0.21%
Amean rps-50.0th-qrtle-79            1.00     -0.23%
Amean wakeup-99.0th-qrtle-79    1.00      2.00%

                                                     baseline     patch2
Amean request-99.0th-qrtle-1      1.00     -0.46%
Amean rps-50.0th-qrtle-1              1.00      0.11%
Amean wakeup-99.0th-qrtle-1      1.00     -2.01%
Amean request-99.0th-qrtle-2      1.00     -0.08%
Amean rps-50.0th-qrtle-2              1.00      0.00%
Amean wakeup-99.0th-qrtle-2      1.00     -1.42%
Amean request-99.0th-qrtle-4      1.00     -1.16%
Amean rps-50.0th-qrtle-4              1.00      0.11%
Amean wakeup-99.0th-qrtle-4      1.00     -1.30%
Amean request-99.0th-qrtle-8      1.00     -0.08%
Amean rps-50.0th-qrtle-8              1.00     -0.40%
Amean wakeup-99.0th-qrtle-8      1.00      1.25%
Amean request-99.0th-qrtle-16    1.00      0.46%
Amean rps-50.0th-qrtle-16            1.00     -0.06%
Amean wakeup-99.0th-qrtle-16    1.00      2.52%
Amean request-99.0th-qrtle-32    1.00     14.83%
Amean rps-50.0th-qrtle-32            1.00      0.75%
Amean wakeup-99.0th-qrtle-32    1.00      3.03%
Amean request-99.0th-qrtle-64    1.00     -0.44%
Amean rps-50.0th-qrtle-64            1.00      0.28%
Amean wakeup-99.0th-qrtle-64    1.00     -3.50%
Amean request-99.0th-qrtle-79    1.00     -0.09%
Amean rps-50.0th-qrtle-79            1.00      0.08%
Amean wakeup-99.0th-qrtle-79    1.00     -1.20%

                                                      baseline     patch3
Amean request-99.0th-qrtle-1      1.00      0.31%
Amean rps-50.0th-qrtle-1              1.00     -0.17%
Amean wakeup-99.0th-qrtle-1      1.00      0.44%
Amean request-99.0th-qrtle-2      1.00     -0.61%
Amean rps-50.0th-qrtle-2              1.00     -0.29%
Amean wakeup-99.0th-qrtle-2      1.00      1.93%
Amean request-99.0th-qrtle-4      1.00     -1.62%
Amean rps-50.0th-qrtle-4              1.00     -0.17%
Amean wakeup-99.0th-qrtle-4      1.00      0.00%
Amean request-99.0th-qrtle-8      1.00      0.00%
Amean rps-50.0th-qrtle-8              1.00     -0.40%
Amean wakeup-99.0th-qrtle-8      1.00     -0.29%
Amean request-99.0th-qrtle-16    1.00      0.53%
Amean rps-50.0th-qrtle-16            1.00     -0.17%
Amean wakeup-99.0th-qrtle-16    1.00     -1.03%
Amean request-99.0th-qrtle-32    1.00      0.09%
Amean rps-50.0th-qrtle-32            1.00     -0.17%
Amean wakeup-99.0th-qrtle-32    1.00      2.41%
Amean request-99.0th-qrtle-64    1.00      0.26%
Amean rps-50.0th-qrtle-64            1.00     -0.16%
Amean wakeup-99.0th-qrtle-64    1.00     -2.00%
Amean request-99.0th-qrtle-79    1.00      0.26%
Amean rps-50.0th-qrtle-79            1.00     -0.46%
Amean wakeup-99.0th-qrtle-79    1.00      1.20%

                                                    baseline     patch4
Amean request-99.0th-qrtle-1      1.00     -0.15%
Amean rps-50.0th-qrtle-1              1.00     -0.06%
Amean wakeup-99.0th-qrtle-1      1.00     -2.88%
Amean request-99.0th-qrtle-2      1.00     -0.31%
Amean rps-50.0th-qrtle-2              1.00     -0.29%
Amean wakeup-99.0th-qrtle-2      1.00     -0.59%
Amean request-99.0th-qrtle-4      1.00     -0.23%
Amean rps-50.0th-qrtle-4              1.00     -0.11%
Amean wakeup-99.0th-qrtle-4      1.00     -0.41%
Amean request-99.0th-qrtle-8      1.00     -0.08%
Amean rps-50.0th-qrtle-8              1.00     -0.52%
Amean wakeup-99.0th-qrtle-8      1.00      1.91%
Amean request-99.0th-qrtle-16    1.00      0.76%
Amean rps-50.0th-qrtle-16            1.00      0.06%
Amean wakeup-99.0th-qrtle-16    1.00      1.03%
Amean request-99.0th-qrtle-32    1.00      8.36%
Amean rps-50.0th-qrtle-32            1.00      0.00%
Amean wakeup-99.0th-qrtle-32    1.00     -1.05%
Amean request-99.0th-qrtle-64    1.00      0.13%
Amean rps-50.0th-qrtle-64            1.00      0.00%
Amean wakeup-99.0th-qrtle-64    1.00     -4.00%
Amean request-99.0th-qrtle-79    1.00     -0.39%
Amean rps-50.0th-qrtle-79            1.00      0.14%
Amean wakeup-99.0th-qrtle-79    1.00     -0.40%

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  2025-07-07  2:35 ` [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
@ 2025-09-01  5:10   ` Chen, Yu C
  2025-09-01 13:24     ` Deng, Pan
  0 siblings, 1 reply; 16+ messages in thread
From: Chen, Yu C @ 2025-09-01  5:10 UTC (permalink / raw)
  To: Pan Deng; +Cc: linux-kernel, tianyou.li, tim.c.chen, peterz, mingo, Chen Yu

On 7/7/2025 10:35 AM, Pan Deng wrote:
> When running a multi-instance FFmpeg workload on an HCC system, significant
> cache line contention is observed around `cpupri_vec->count` and `mask` in
> struct root_domain.
> 

[it seems that my last reply did not make it to the lkml][snip]

> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> index d6cba0020064..245b0fa626be 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -9,7 +9,7 @@
>   
>   struct cpupri_vec {
>   	atomic_t		count;
> -	cpumask_var_t		mask;
> +	cpumask_var_t		mask	____cacheline_aligned;

Just curious, since this is to avoid cache contention among CPUs,
is it better to use ____cacheline_aligned_in_smp, so the single
CPU system is not impacted.

thanks,
Chenyu>   };
>   
>   struct cpupri {

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
  2025-09-01  5:10   ` Chen, Yu C
@ 2025-09-01 13:24     ` Deng, Pan
  0 siblings, 0 replies; 16+ messages in thread
From: Deng, Pan @ 2025-09-01 13:24 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: linux-kernel@vger.kernel.org, Li, Tianyou,
	tim.c.chen@linux.intel.com, peterz@infradead.org,
	mingo@kernel.org, Chen Yu

Thanks Yu, will update the patch.

Best Regards
Pan

> -----Original Message-----
> From: Chen, Yu C <yu.c.chen@intel.com>
> Sent: Monday, September 1, 2025 1:10 PM
> To: Deng, Pan <pan.deng@intel.com>
> Cc: linux-kernel@vger.kernel.org; Li, Tianyou <tianyou.li@intel.com>;
> tim.c.chen@linux.intel.com; peterz@infradead.org; mingo@kernel.org; Chen Yu
> <yu.chen.surf@gmail.com>
> Subject: Re: [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache
> line contention
> 
> On 7/7/2025 10:35 AM, Pan Deng wrote:
> > When running a multi-instance FFmpeg workload on an HCC system,
> significant
> > cache line contention is observed around `cpupri_vec->count` and `mask` in
> > struct root_domain.
> >
> 
> [it seems that my last reply did not make it to the lkml][snip]
> 
> > diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> > index d6cba0020064..245b0fa626be 100644
> > --- a/kernel/sched/cpupri.h
> > +++ b/kernel/sched/cpupri.h
> > @@ -9,7 +9,7 @@
> >
> >   struct cpupri_vec {
> >   	atomic_t		count;
> > -	cpumask_var_t		mask;
> > +	cpumask_var_t		mask	____cacheline_aligned;
> 
> Just curious, since this is to avoid cache contention among CPUs,
> is it better to use ____cacheline_aligned_in_smp, so the single
> CPU system is not impacted.
> 
> thanks,
> Chenyu>   };
> >
> >   struct cpupri {

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-09-01 13:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-07  2:35 [PATCH 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
2025-07-07  2:35 ` [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
2025-09-01  5:10   ` Chen, Yu C
2025-09-01 13:24     ` Deng, Pan
2025-07-07  2:35 ` [PATCH 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
2025-07-07  2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
2025-07-07  6:53   ` kernel test robot
2025-07-07 11:36     ` Deng, Pan
2025-07-07  6:53   ` kernel test robot
2025-07-08  5:33   ` kernel test robot
2025-07-08 14:02     ` Deng, Pan
2025-07-09  8:56       ` Li, Philip
2025-07-07  2:35 ` [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
2025-07-21 11:23   ` Chen, Yu C
2025-07-22 14:46     ` Deng, Pan
2025-08-06 14:00       ` Deng, Pan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).