* [PATCH 0/4] sched/rt: mitigate root_domain cache line contention
@ 2025-07-07 2:35 Pan Deng
2025-07-07 2:35 ` [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Pan Deng @ 2025-07-07 2:35 UTC (permalink / raw)
To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng
From: Deng Pan <pan.deng@intel.com>
When running multi-instance FFmpeg workload in cloud environment,
cache line contention is severe during the access to root_domain data
structures, which significantly degrades performance.
The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.
Profiling shows the kernel consumes ~20% of CPU cycles, which is
excessive in this scenario. The overhead primarily comes from RT task
scheduling functions like `cpupri_set`, `cpupri_find_fitness`,
`dequeue_pushable_task`, `enqueue_pushable_task`, `pull_rt_task`,
`__find_first_and_bit`, and `__bitmap_and`. This is due to read/write
contention on root_domain cache lines.
The `perf c2c` report, sorted by contention severity, reveals:
root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` is heavily loaded/stored,
since counts[0] is more frequently updated than others along with a
rt task enqueues an empty runq or dequeues from a non-overloaded runq.
- `rto_mask` is heavily loaded
- `rto_loop_next` and `rto_loop_start` are frequently stored
- `rto_push_work` and `rto_lock` are lightly accessed
- cycles per load: ~10K to 59K.
root_domain cache line 1:
- `rto_count` is frequently loaded/stored
- `overloaded` is heavily loaded
- cycles per load: ~2.8K to 44K
cpumask (bitmap) cache line of cpupri_vec->mask:
- bits are loaded during cpupri_find
- bits are stored during cpupri_set
- cycles per load: ~2.2K to 8.7K
The end cache line of cpupri:
- `cpupri_vec->count` and `mask` contends. The transcoding threads use
rt pri 99, so that the contention occurs in the end.
- cycles per load: ~1.5K to 10.5K
According to above, we propose 4 patches to mitigate the contention.
Patch 1: Reorganize `cpupri_vec`, separate `count`, `mask` fields,
reducing contention on root_domain cache line 3 and cpupri's
last cache line.
Patch 2: Restructure `root_domain` structure to minimize contention of
root_domain cache line 1 and 3 by reordering fields.
Patch 3: Split `root_domain->rto_count` to per-NUMA-node counters,
reducing the contention on root_domain cache line 1.
Patch 4: Split `cpupri_vec->cpumask` to per-NUMA-node bitmaps, reducing
load/store contention on the cpumask bitmap cache line.
Evaluation:
Performance improvements (FPS, relative to baseline):
- Patch 1: +11.0%
- Patch 2: +5.0%
- Patch 3: +4.0%
- Patch 4: +3.8%
Kernel CPU cycle usage reduction:
- Patch 1: 20.0% -> 11.0%
- Patch 2: 20.0% -> 17.7%
- Patch 3: 20.0% -> 18.6%
- Patch 4: 20.0% -> 18.7%
Cycles per load reduction (by perf c2c report):
- Patch 1:
- `root_domain` cache line 3: 10K–59K -> 0.5K–8K
- `cpupri` last cache line: 1.5K–10.5K -> eliminated
- Patch 2:
- `root_domain` cache line 1: 2.8K–44K -> 2.1K–2.7K
- `root_domain` cache line 3: 10K–59K -> eliminated
- Patch 3:
- `root_domain` cache line 1: 2.8K–44K -> eliminated
- Patch 4:
- `cpupri_vec->mask` cache line: 2.2K–8.7K -> 0.5K–2.2K
Comments are appreciated.
Pan Deng (4):
sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
sched/rt: Restructure root_domain to reduce cacheline contention
sched/rt: Split root_domain->rto_count to per-NUMA-node counters
sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce
contention
kernel/sched/cpupri.c | 200 ++++++++++++++++++++++++++++++++++++----
kernel/sched/cpupri.h | 6 +-
kernel/sched/rt.c | 65 ++++++++++++-
kernel/sched/sched.h | 61 ++++++------
kernel/sched/topology.c | 7 ++
5 files changed, 291 insertions(+), 48 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
2025-07-07 2:35 [PATCH 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
@ 2025-07-07 2:35 ` Pan Deng
2025-09-01 5:10 ` Chen, Yu C
2025-07-07 2:35 ` [PATCH 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
` (2 subsequent siblings)
3 siblings, 1 reply; 16+ messages in thread
From: Pan Deng @ 2025-07-07 2:35 UTC (permalink / raw)
To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng
When running a multi-instance FFmpeg workload on an HCC system, significant
cache line contention is observed around `cpupri_vec->count` and `mask` in
struct root_domain.
The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.
perf c2c tool reveals:
root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored
and contends with other fields, since counts[0] is more frequently
updated than others along with a rt task enqueues an empty runq or
dequeues from a non-overloaded runq.
- cycles per load: ~10K to 59K
cpupri's last cache line:
- `cpupri_vec->count` and `mask` contends. The transcoding threads use
rt pri 99, so that the contention occurs in the end.
- cycles per load: ~1.5K to 10.5K
This change mitigates `cpupri_vec->count`, `mask` related contentions by
separating each count and mask into different cache lines.
As a result:
- FPS improves by ~11%
- Kernel cycles% drops from ~20% to ~11%
- `count` and `mask` related cache line contention is mitigated, perf c2c
shows root_domain cache line 3 `cycles per load` drops from ~10K-59K
to ~0.5K-8K, cpupri's last cache line no longer appears in the report.
Note: The side effect of this change is that struct cpupri size is
increased from 26 cache lines to 203 cache lines.
An alternative approach could be separating `counts` and `masks` into 2
vectors in cpupri_vec (counts[] and masks[]), and add two paddings:
1. Between counts[0] and counts[1], since counts[0] is more frequently
updated than others.
2. Between the two vectors, since counts[] is read-write access while
masks[] is read access when it stores pointers.
The alternative approach introduces the complexity of 31+/21- LoC changes,
it achieves almost the same performance, at the same time, struct cpupri
size is reduced from 26 cache lines to 21 cache lines.
Appendix:
1. Current layout of contended data structures:
struct root_domain {
atomic_t refcount; /* 0 4 */
atomic_t rto_count; /* 4 4 */
struct callback_head rcu __attribute__((__aligned__(8)));/*8 16 */
cpumask_var_t span; /* 24 8 */
cpumask_var_t online; /* 32 8 */
bool overloaded; /* 40 1 */
bool overutilized; /* 41 1 */
/* XXX 6 bytes hole, try to pack */
cpumask_var_t dlo_mask; /* 48 8 */
atomic_t dlo_count; /* 56 4 */
/* XXX 4 bytes hole, try to pack */
/* --- cacheline 1 boundary (64 bytes) --- */
struct dl_bw dl_bw; /* 64 24 */
struct cpudl cpudl; /* 88 24 */
u64 visit_gen; /* 112 8 */
struct irq_work rto_push_work; /* 120 32 */
/* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
raw_spinlock_t rto_lock; /* 152 4 */
int rto_loop; /* 156 4 */
int rto_cpu; /* 160 4 */
atomic_t rto_loop_next; /* 164 4 */
atomic_t rto_loop_start; /* 168 4 */
/* XXX 4 bytes hole, try to pack */
cpumask_var_t rto_mask; /* 176 8 */
struct cpupri cpupri; /* 184 1624 */
/* --- cacheline 28 boundary (1792 bytes) was 16 bytes ago --- */
struct perf_domain * pd; /* 1808 8 */
/* size: 1816, cachelines: 29, members: 21 */
/* sum members: 1802, holes: 3, sum holes: 14 */
/* forced alignments: 1 */
/* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));
struct cpupri {
struct cpupri_vec pri_to_cpu[101]; /* 0 1616 */
/* --- cacheline 25 boundary (1600 bytes) was 16 bytes ago --- */
int * cpu_to_pri; /* 1616 8 */
/* size: 1624, cachelines: 26, members: 2 */
/* last cacheline: 24 bytes */
};
struct cpupri_vec {
atomic_t count; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
cpumask_var_t mask; /* 8 8 */
/* size: 16, cachelines: 1, members: 2 */
/* sum members: 12, holes: 1, sum holes: 4 */
/* last cacheline: 16 bytes */
};
2. Perf c2c report of root_domain cache line 3:
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
353 44 62 0xff14d42c400e3880
------- ------- ------ ------ ------ ------ ------------------------
0.00% 2.27% 0.00% 0x0 21683 6 __flush_smp_call_function_
0.00% 2.27% 0.00% 0x0 22294 5 __flush_smp_call_function_
0.28% 0.00% 0.00% 0x0 0 2 irq_work_queue_on
0.28% 0.00% 0.00% 0x0 27824 4 irq_work_single
0.00% 0.00% 1.61% 0x0 28151 6 irq_work_queue_on
0.57% 0.00% 0.00% 0x18 21822 8 native_queued_spin_lock_sl
0.28% 2.27% 0.00% 0x18 16101 10 native_queued_spin_lock_sl
0.57% 0.00% 0.00% 0x18 33199 5 native_queued_spin_lock_sl
0.00% 0.00% 1.61% 0x18 10908 32 _raw_spin_lock
0.00% 0.00% 1.61% 0x18 59770 2 _raw_spin_lock
0.00% 0.00% 1.61% 0x18 0 1 _raw_spin_unlock
1.42% 0.00% 0.00% 0x20 12918 20 pull_rt_task
0.85% 0.00% 25.81% 0x24 31123 199 pull_rt_task
0.85% 0.00% 3.23% 0x24 38218 24 pull_rt_task
0.57% 4.55% 19.35% 0x28 30558 207 pull_rt_task
0.28% 0.00% 0.00% 0x28 55504 10 pull_rt_task
18.70% 18.18% 0.00% 0x30 26438 291 dequeue_pushable_task
17.28% 22.73% 0.00% 0x30 29347 281 enqueue_pushable_task
1.70% 2.27% 0.00% 0x30 12819 31 enqueue_pushable_task
0.28% 0.00% 0.00% 0x30 17726 18 dequeue_pushable_task
34.56% 29.55% 0.00% 0x38 25509 527 cpupri_find_fitness
13.88% 11.36% 24.19% 0x38 30654 342 cpupri_set
3.12% 2.27% 0.00% 0x38 18093 39 cpupri_set
1.70% 0.00% 0.00% 0x38 37661 52 cpupri_find_fitness
1.42% 2.27% 19.35% 0x38 31110 211 cpupri_set
1.42% 0.00% 1.61% 0x38 45035 31 cpupri_set
3. Perf c2c report of cpupri's last cache line
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
149 43 41 0xff14d42c400e3ec0
------- ------- ------ ------ ------ ------ ------------------------
8.72% 11.63% 0.00% 0x8 2001 165 cpupri_find_fitness
1.34% 2.33% 0.00% 0x18 1456 151 cpupri_find_fitness
8.72% 9.30% 58.54% 0x28 1744 263 cpupri_set
2.01% 4.65% 41.46% 0x28 1958 301 cpupri_set
1.34% 0.00% 0.00% 0x28 10580 6 cpupri_set
69.80% 67.44% 0.00% 0x30 1754 347 cpupri_set
8.05% 4.65% 0.00% 0x30 2144 256 cpupri_set
Signed-off-by: Pan Deng <pan.deng@intel.com>
Signed-off-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/cpupri.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..245b0fa626be 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -9,7 +9,7 @@
struct cpupri_vec {
atomic_t count;
- cpumask_var_t mask;
+ cpumask_var_t mask ____cacheline_aligned;
};
struct cpupri {
--
2.43.5
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 2/4] sched/rt: Restructure root_domain to reduce cacheline contention
2025-07-07 2:35 [PATCH 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
2025-07-07 2:35 ` [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
@ 2025-07-07 2:35 ` Pan Deng
2025-07-07 2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
2025-07-07 2:35 ` [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
3 siblings, 0 replies; 16+ messages in thread
From: Pan Deng @ 2025-07-07 2:35 UTC (permalink / raw)
To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng
When running a multi-instance FFmpeg workload on HCC system, significant
contention is observed in root_domain cacheline 1 and 3.
The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.
perf c2c tool reveals (sorted by contention severity):
root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored,
since counts[0] is more frequently updated than others along with a
rt task enqueues an empty runq or dequeues from a non-overloaded runq.
- `rto_mask` (0x30) is heavily loaded
- `rto_loop_next` (0x24) and `rto_loop_start` (0x28) are frequently stored
- `rto_push_work` (0x0) and `rto_lock` (0x18) are lightly accessed
- cycles per load: ~10K to 59K
root_domain cache line 1:
- `rto_count` (0x4) is frequently loaded/stored
- `overloaded` (0x28) is heavily loaded
- cycles per load: ~2.8K to 44K:
This change adjusts the layout of `root_domain` to isolate these contended
fields across separate cache lines:
1. `rto_count` remains in the 1st cache line; `overloaded` and
`overutilized` are moved to the last cache line
2. `rto_push_work` is placed in the 2nd cache line
3. `rto_loop_start`, `rto_loop_next`, and `rto_lock` remain in the 3rd
cache line; `rto_mask` is moved near `pd` in the penultimate cache line
4. `cpupri` starts at the 4th cache line to prevent `pri_to_cpu[0].count`
contending with fields in cache line 3.
With this change:
- FPS improves by ~5%
- Kernel cycles% drops from ~20% to ~17.7%
- root_domain cache line 3 no longer appears in perf-c2c report
- cycles per load of root_domain cache line 1 is reduced to from
~2.8K-44K to ~2.1K-2.7K
According to the nature of the change, to my understanding, it doesn`t
introduce any negative impact in other scenario.
Note: This change increases the size of `root_domain` from 29 to 31 cache
lines, it's considered acceptable since `root_domain` is a single global
object.
Appendix:
1. Current layout of contended data structures:
struct root_domain {
atomic_t refcount; /* 0 4 */
atomic_t rto_count; /* 4 4 */
struct callback_head rcu __attribute__((__aligned__(8)));/*8 16 */
cpumask_var_t span; /* 24 8 */
cpumask_var_t online; /* 32 8 */
bool overloaded; /* 40 1 */
bool overutilized; /* 41 1 */
/* XXX 6 bytes hole, try to pack */
cpumask_var_t dlo_mask; /* 48 8 */
atomic_t dlo_count; /* 56 4 */
/* XXX 4 bytes hole, try to pack */
/* --- cacheline 1 boundary (64 bytes) --- */
struct dl_bw dl_bw; /* 64 24 */
struct cpudl cpudl; /* 88 24 */
u64 visit_gen; /* 112 8 */
struct irq_work rto_push_work; /* 120 32 */
/* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
raw_spinlock_t rto_lock; /* 152 4 */
int rto_loop; /* 156 4 */
int rto_cpu; /* 160 4 */
atomic_t rto_loop_next; /* 164 4 */
atomic_t rto_loop_start; /* 168 4 */
/* XXX 4 bytes hole, try to pack */
cpumask_var_t rto_mask; /* 176 8 */
struct cpupri cpupri; /* 184 1624 */
/* --- cacheline 28 boundary (1792 bytes) was 16 bytes ago --- */
struct perf_domain * pd; /* 1808 8 */
/* size: 1816, cachelines: 29, members: 21 */
/* sum members: 1802, holes: 3, sum holes: 14 */
/* forced alignments: 1 */
/* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));
struct cpupri {
struct cpupri_vec pri_to_cpu[101]; /* 0 1616 */
/* --- cacheline 25 boundary (1600 bytes) was 16 bytes ago --- */
int * cpu_to_pri; /* 1616 8 */
/* size: 1624, cachelines: 26, members: 2 */
/* last cacheline: 24 bytes */
};
struct cpupri_vec {
atomic_t count; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
cpumask_var_t mask; /* 8 8 */
/* size: 16, cachelines: 1, members: 2 */
/* sum members: 12, holes: 1, sum holes: 4 */
/* last cacheline: 16 bytes */
};
2. Perf c2c report of root_domain cache line 3:
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
353 44 62 0xff14d42c400e3880
------- ------- ------ ------ ------ ------ ------------------------
0.00% 2.27% 0.00% 0x0 21683 6 __flush_smp_call_function_
0.00% 2.27% 0.00% 0x0 22294 5 __flush_smp_call_function_
0.28% 0.00% 0.00% 0x0 0 2 irq_work_queue_on
0.28% 0.00% 0.00% 0x0 27824 4 irq_work_single
0.00% 0.00% 1.61% 0x0 28151 6 irq_work_queue_on
0.57% 0.00% 0.00% 0x18 21822 8 native_queued_spin_lock_sl
0.28% 2.27% 0.00% 0x18 16101 10 native_queued_spin_lock_sl
0.57% 0.00% 0.00% 0x18 33199 5 native_queued_spin_lock_sl
0.00% 0.00% 1.61% 0x18 10908 32 _raw_spin_lock
0.00% 0.00% 1.61% 0x18 59770 2 _raw_spin_lock
0.00% 0.00% 1.61% 0x18 0 1 _raw_spin_unlock
1.42% 0.00% 0.00% 0x20 12918 20 pull_rt_task
0.85% 0.00% 25.81% 0x24 31123 199 pull_rt_task
0.85% 0.00% 3.23% 0x24 38218 24 pull_rt_task
0.57% 4.55% 19.35% 0x28 30558 207 pull_rt_task
0.28% 0.00% 0.00% 0x28 55504 10 pull_rt_task
18.70% 18.18% 0.00% 0x30 26438 291 dequeue_pushable_task
17.28% 22.73% 0.00% 0x30 29347 281 enqueue_pushable_task
1.70% 2.27% 0.00% 0x30 12819 31 enqueue_pushable_task
0.28% 0.00% 0.00% 0x30 17726 18 dequeue_pushable_task
34.56% 29.55% 0.00% 0x38 25509 527 cpupri_find_fitness
13.88% 11.36% 24.19% 0x38 30654 342 cpupri_set
3.12% 2.27% 0.00% 0x38 18093 39 cpupri_set
1.70% 0.00% 0.00% 0x38 37661 52 cpupri_find_fitness
1.42% 2.27% 19.35% 0x38 31110 211 cpupri_set
1.42% 0.00% 1.61% 0x38 45035 31 cpupri_set
3. Perf c2c report of root_domain cache line 1:
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
231 43 48 0xff14d42c400e3800
------- ------- ------ ------ ------ ------ ------------------------
22.51% 18.60% 0.00% 0x4 5041 247 pull_rt_task
5.63% 2.33% 45.83% 0x4 6995 315 dequeue_pushable_task
3.90% 4.65% 54.17% 0x4 6587 370 enqueue_pushable_task
0.43% 0.00% 0.00% 0x4 17111 4 enqueue_pushable_task
0.43% 0.00% 0.00% 0x4 44062 4 dequeue_pushable_task
32.03% 27.91% 0.00% 0x28 6393 285 enqueue_task_rt
16.45% 27.91% 0.00% 0x28 5534 139 sched_balance_newidle
14.72% 18.60% 0.00% 0x28 5287 110 dequeue_task_rt
3.46% 0.00% 0.00% 0x28 2820 25 enqueue_task_fair
0.43% 0.00% 0.00% 0x28 220 3 enqueue_task_stop
Signed-off-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
kernel/sched/sched.h | 52 +++++++++++++++++++++++---------------------
1 file changed, 27 insertions(+), 25 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 475bb5998295..dd3c79470bfc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -968,24 +968,29 @@ struct root_domain {
cpumask_var_t span;
cpumask_var_t online;
+ atomic_t dlo_count;
+ struct dl_bw dl_bw;
+ struct cpudl cpudl;
+
+#ifdef HAVE_RT_PUSH_IPI
/*
- * Indicate pullable load on at least one CPU, e.g:
- * - More than one runnable task
- * - Running task is misfit
+ * For IPI pull requests, loop across the rto_mask.
*/
- bool overloaded;
-
- /* Indicate one or more CPUs over-utilized (tipping point) */
- bool overutilized;
+ struct irq_work rto_push_work;
+ raw_spinlock_t rto_lock;
+ /* These are only updated and read within rto_lock */
+ int rto_loop;
+ int rto_cpu;
+ /* These atomics are updated outside of a lock */
+ atomic_t rto_loop_next;
+ atomic_t rto_loop_start;
+#endif
/*
* The bit corresponding to a CPU gets set here if such CPU has more
* than one runnable -deadline task (as it is below for RT tasks).
*/
cpumask_var_t dlo_mask;
- atomic_t dlo_count;
- struct dl_bw dl_bw;
- struct cpudl cpudl;
/*
* Indicate whether a root_domain's dl_bw has been checked or
@@ -995,32 +1000,29 @@ struct root_domain {
* that u64 is 'big enough'. So that shouldn't be a concern.
*/
u64 visit_cookie;
+ struct cpupri cpupri ____cacheline_aligned;
-#ifdef HAVE_RT_PUSH_IPI
/*
- * For IPI pull requests, loop across the rto_mask.
+ * NULL-terminated list of performance domains intersecting with the
+ * CPUs of the rd. Protected by RCU.
*/
- struct irq_work rto_push_work;
- raw_spinlock_t rto_lock;
- /* These are only updated and read within rto_lock */
- int rto_loop;
- int rto_cpu;
- /* These atomics are updated outside of a lock */
- atomic_t rto_loop_next;
- atomic_t rto_loop_start;
-#endif
+ struct perf_domain __rcu *pd ____cacheline_aligned;
+
/*
* The "RT overload" flag: it gets set if a CPU has more than
* one runnable RT task.
*/
cpumask_var_t rto_mask;
- struct cpupri cpupri;
/*
- * NULL-terminated list of performance domains intersecting with the
- * CPUs of the rd. Protected by RCU.
+ * Indicate pullable load on at least one CPU, e.g:
+ * - More than one runnable task
+ * - Running task is misfit
*/
- struct perf_domain __rcu *pd;
+ bool overloaded ____cacheline_aligned;
+
+ /* Indicate one or more CPUs over-utilized (tipping point) */
+ bool overutilized;
};
extern void init_defrootdomain(void);
--
2.43.5
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
2025-07-07 2:35 [PATCH 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
2025-07-07 2:35 ` [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
2025-07-07 2:35 ` [PATCH 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
@ 2025-07-07 2:35 ` Pan Deng
2025-07-07 6:53 ` kernel test robot
` (2 more replies)
2025-07-07 2:35 ` [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
3 siblings, 3 replies; 16+ messages in thread
From: Pan Deng @ 2025-07-07 2:35 UTC (permalink / raw)
To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng
When running a multi-instance FFmpeg workload on HCC system, significant
contention is observed on root_domain `rto_count` and `overloaded` fields.
The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.
perf c2c tool reveals:
root_domain cache line 1:
- `rto_count` (0x4) is frequently loaded/stored
- `overloaded` (0x28) is heavily loaded
- cycles per load: ~2.8K to 44K:
A separate patch rearranges root_domain to place `overloaded` on a
different cache line, but this alone is insufficient to resolve the
contention on `rto_count`. As a complementary, this patch splits
`rto_count` into per-numa-node counters to reduce the contention.
With this change:
- FPS improves by ~4%
- Kernel cycles% drops from ~20% to ~18.6%
- The cache line no longer appears in perf-c2c report
Appendix:
1. Perf c2c report of root_domain cache line 1:
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
231 43 48 0xff14d42c400e3800
------- ------- ------ ------ ------ ------ ------------------------
22.51% 18.60% 0.00% 0x4 5041 247 pull_rt_task
5.63% 2.33% 45.83% 0x4 6995 315 dequeue_pushable_task
3.90% 4.65% 54.17% 0x4 6587 370 enqueue_pushable_task
0.43% 0.00% 0.00% 0x4 17111 4 enqueue_pushable_task
0.43% 0.00% 0.00% 0x4 44062 4 dequeue_pushable_task
32.03% 27.91% 0.00% 0x28 6393 285 enqueue_task_rt
16.45% 27.91% 0.00% 0x28 5534 139 sched_balance_newidle
14.72% 18.60% 0.00% 0x28 5287 110 dequeue_task_rt
3.46% 0.00% 0.00% 0x28 2820 25 enqueue_task_fair
0.43% 0.00% 0.00% 0x28 220 3 enqueue_task_stop
Signed-off-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
kernel/sched/rt.c | 65 +++++++++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 9 +++++-
kernel/sched/topology.c | 7 +++++
3 files changed, 77 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e40422c37033..cc820dbde6d6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -337,9 +337,58 @@ static inline bool need_pull_rt_task(struct rq *rq, struct task_struct *prev)
return rq->online && rq->rt.highest_prio.curr > prev->prio;
}
+int rto_counts_init(atomic_tp **rto_counts)
+{
+ int i;
+ atomic_tp *counts = kzalloc(nr_node_ids * sizeof(atomic_tp), GFP_KERNEL);
+
+ if (!counts)
+ return -ENOMEM;
+
+ for (i = 0; i < nr_node_ids; i++) {
+ counts[i] = kzalloc_node(sizeof(atomic_t), GFP_KERNEL, i);
+
+ if (!counts[i])
+ goto cleanup;
+ }
+
+ *rto_counts = counts;
+ return 0;
+
+cleanup:
+ while (i--)
+ kfree(counts[i]);
+
+ kfree(counts);
+ return -ENOMEM;
+}
+
+void rto_counts_cleanup(atomic_tp *rto_counts)
+{
+ for (int i = 0; i < nr_node_ids; i++)
+ kfree(rto_counts[i]);
+
+ kfree(rto_counts);
+}
+
static inline int rt_overloaded(struct rq *rq)
{
- return atomic_read(&rq->rd->rto_count);
+ int count = 0;
+ int cur_node, nid;
+
+ cur_node = numa_node_id();
+
+ for (int i = 0; i < nr_node_ids; i++) {
+ nid = (cur_node + i) % nr_node_ids;
+ count += atomic_read(rq->rd->rto_counts[nid]);
+
+ // The caller only checks if it is 0
+ // or 1, so that return once > 1
+ if (count > 1)
+ return count;
+ }
+
+ return count;
}
static inline void rt_set_overload(struct rq *rq)
@@ -358,7 +407,7 @@ static inline void rt_set_overload(struct rq *rq)
* Matched by the barrier in pull_rt_task().
*/
smp_wmb();
- atomic_inc(&rq->rd->rto_count);
+ atomic_inc(rq->rd->rto_counts[cpu_to_node(rq->cpu)]);
}
static inline void rt_clear_overload(struct rq *rq)
@@ -367,7 +416,7 @@ static inline void rt_clear_overload(struct rq *rq)
return;
/* the order here really doesn't matter */
- atomic_dec(&rq->rd->rto_count);
+ atomic_dec(rq->rd->rto_counts[cpu_to_node(rq->cpu)]);
cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask);
}
@@ -443,6 +492,16 @@ static inline void dequeue_pushable_task(struct rq *rq, struct task_struct *p)
static inline void rt_queue_push_tasks(struct rq *rq)
{
}
+
+int rto_counts_init(atomic_tp **rto_counts)
+{
+ return 0;
+}
+
+void rto_counts_cleanup(atomic_tp *rto_counts)
+{
+}
+
#endif /* CONFIG_SMP */
static void enqueue_top_rt_rq(struct rt_rq *rt_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dd3c79470bfc..f80968724dd6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -953,6 +953,8 @@ struct perf_domain {
struct rcu_head rcu;
};
+typedef atomic_t *atomic_tp;
+
/*
* We add the notion of a root-domain which will be used to define per-domain
* variables. Each exclusive cpuset essentially defines an island domain by
@@ -963,12 +965,15 @@ struct perf_domain {
*/
struct root_domain {
atomic_t refcount;
- atomic_t rto_count;
struct rcu_head rcu;
cpumask_var_t span;
cpumask_var_t online;
atomic_t dlo_count;
+
+ /* rto_count per node */
+ atomic_tp *rto_counts;
+
struct dl_bw dl_bw;
struct cpudl cpudl;
@@ -1030,6 +1035,8 @@ extern int sched_init_domains(const struct cpumask *cpu_map);
extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
extern void sched_get_rd(struct root_domain *rd);
extern void sched_put_rd(struct root_domain *rd);
+extern int rto_counts_init(atomic_tp **rto_counts);
+extern void rto_counts_cleanup(atomic_tp *rto_counts);
static inline int get_rd_overloaded(struct root_domain *rd)
{
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b958fe48e020..166dc8177a44 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -457,6 +457,7 @@ static void free_rootdomain(struct rcu_head *rcu)
{
struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
+ rto_counts_cleanup(rd->rto_counts);
cpupri_cleanup(&rd->cpupri);
cpudl_cleanup(&rd->cpudl);
free_cpumask_var(rd->dlo_mask);
@@ -549,8 +550,14 @@ static int init_rootdomain(struct root_domain *rd)
if (cpupri_init(&rd->cpupri) != 0)
goto free_cpudl;
+
+ if (rto_counts_init(&rd->rto_counts) != 0)
+ goto free_cpupri;
+
return 0;
+free_cpupri:
+ cpupri_cleanup(&rd->cpupri);
free_cpudl:
cpudl_cleanup(&rd->cpudl);
free_rto_mask:
--
2.43.5
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
2025-07-07 2:35 [PATCH 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
` (2 preceding siblings ...)
2025-07-07 2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
@ 2025-07-07 2:35 ` Pan Deng
2025-07-21 11:23 ` Chen, Yu C
3 siblings, 1 reply; 16+ messages in thread
From: Pan Deng @ 2025-07-07 2:35 UTC (permalink / raw)
To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng
When running a multi-instance FFmpeg workload on HCC system, significant
contention is observed on bitmap of `cpupri_vec->cpumask`.
The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.
perf c2c tool reveals:
cpumask (bitmap) cache line of `cpupri_vec->mask`:
- bits are loaded during cpupri_find
- bits are stored during cpupri_set
- cycles per load: ~2.2K to 8.7K
This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
mitigate false sharing.
As a result:
- FPS improves by ~3.8%
- Kernel cycles% drops from ~20% to ~18.7%
- Cache line contention is mitigated, perf-c2c shows cycles per load
drops from ~2.2K-8.7K to ~0.5K-2.2K
Note: CONFIG_CPUMASK_OFFSTACK=n remains unchanged.
Appendix:
1. Perf c2c report of `cpupri_vec->mask` bitmap cache line:
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
155 39 39 0xff14d52c4682d800
------- ------- ------ ------ ------ ------ ------------------------
43.23% 43.59% 0.00% 0x0 3489 415 _find_first_and_bit
3.23% 5.13% 0.00% 0x0 3478 107 __bitmap_and
3.23% 0.00% 0.00% 0x0 2712 33 _find_first_and_bit
1.94% 0.00% 7.69% 0x0 5992 33 cpupri_set
0.00% 0.00% 5.13% 0x0 3733 19 cpupri_set
12.90% 12.82% 0.00% 0x8 3452 297 _find_first_and_bit
1.29% 2.56% 0.00% 0x8 3007 117 __bitmap_and
0.00% 5.13% 0.00% 0x8 3041 20 _find_first_and_bit
0.00% 2.56% 2.56% 0x8 2374 22 cpupri_set
0.00% 0.00% 7.69% 0x8 4194 38 cpupri_set
8.39% 2.56% 0.00% 0x10 3336 264 _find_first_and_bit
3.23% 0.00% 0.00% 0x10 3023 46 _find_first_and_bit
2.58% 0.00% 0.00% 0x10 3040 130 __bitmap_and
1.29% 0.00% 12.82% 0x10 4075 34 cpupri_set
0.00% 0.00% 2.56% 0x10 2197 19 cpupri_set
0.00% 2.56% 7.69% 0x18 4085 27 cpupri_set
0.00% 2.56% 0.00% 0x18 3128 220 _find_first_and_bit
0.00% 0.00% 5.13% 0x18 3028 20 cpupri_set
2.58% 2.56% 0.00% 0x20 3089 198 _find_first_and_bit
1.29% 0.00% 5.13% 0x20 5114 29 cpupri_set
0.65% 2.56% 0.00% 0x20 3224 96 __bitmap_and
0.65% 0.00% 7.69% 0x20 4392 31 cpupri_set
2.58% 0.00% 0.00% 0x28 3327 214 _find_first_and_bit
0.65% 2.56% 5.13% 0x28 5252 31 cpupri_set
0.65% 0.00% 7.69% 0x28 8755 25 cpupri_set
0.65% 0.00% 0.00% 0x28 4414 14 _find_first_and_bit
1.29% 2.56% 0.00% 0x30 3139 171 _find_first_and_bit
0.65% 0.00% 7.69% 0x30 2185 18 cpupri_set
0.65% 0.00% 0.00% 0x30 3404 108 __bitmap_and
0.00% 0.00% 2.56% 0x30 5542 21 cpupri_set
3.23% 5.13% 0.00% 0x38 3493 190 _find_first_and_bit
3.23% 2.56% 0.00% 0x38 3171 108 __bitmap_and
0.00% 2.56% 7.69% 0x38 3285 14 cpupri_set
0.00% 0.00% 5.13% 0x38 4035 27 cpupri_set
Signed-off-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
kernel/sched/cpupri.c | 200 ++++++++++++++++++++++++++++++++++++++----
kernel/sched/cpupri.h | 4 +
2 files changed, 186 insertions(+), 18 deletions(-)
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index 42c40cfdf836..306b6baff4cd 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -64,6 +64,143 @@ static int convert_prio(int prio)
return cpupri;
}
+#ifdef CONFIG_CPUMASK_OFFSTACK
+static inline int alloc_vec_masks(struct cpupri_vec *vec)
+{
+ int i;
+
+ for (i = 0; i < nr_node_ids; i++) {
+ if (!zalloc_cpumask_var_node(&vec->masks[i], GFP_KERNEL, i))
+ goto cleanup;
+
+ // Clear masks of cur node, set others
+ bitmap_complement(cpumask_bits(vec->masks[i]),
+ cpumask_bits(cpumask_of_node(i)), small_cpumask_bits);
+ }
+ return 0;
+
+cleanup:
+ while (i--)
+ free_cpumask_var(vec->masks[i]);
+ return -ENOMEM;
+}
+
+static inline void free_vec_masks(struct cpupri_vec *vec)
+{
+ for (int i = 0; i < nr_node_ids; i++)
+ free_cpumask_var(vec->masks[i]);
+}
+
+static inline int setup_vec_mask_var_ts(struct cpupri *cp)
+{
+ int i;
+
+ for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+ struct cpupri_vec *vec = &cp->pri_to_cpu[i];
+
+ vec->masks = kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL);
+ if (!vec->masks)
+ goto cleanup;
+ }
+ return 0;
+
+cleanup:
+ /* Free any already allocated masks */
+ while (i--) {
+ kfree(cp->pri_to_cpu[i].masks);
+ cp->pri_to_cpu[i].masks = NULL;
+ }
+
+ return -ENOMEM;
+}
+
+static inline void free_vec_mask_var_ts(struct cpupri *cp)
+{
+ for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+ kfree(cp->pri_to_cpu[i].masks);
+ cp->pri_to_cpu[i].masks = NULL;
+ }
+}
+
+static inline int
+available_cpu_in_nodes(struct task_struct *p, struct cpupri_vec *vec)
+{
+ int cur_node = numa_node_id();
+
+ for (int i = 0; i < nr_node_ids; i++) {
+ int nid = (cur_node + i) % nr_node_ids;
+
+ if (cpumask_first_and_and(&p->cpus_mask, vec->masks[nid],
+ cpumask_of_node(nid)) < nr_cpu_ids)
+ return 1;
+ }
+
+ return 0;
+}
+
+#define available_cpu_in_vec available_cpu_in_nodes
+
+#else /* !CONFIG_CPUMASK_OFFSTACK */
+
+static inline int alloc_vec_masks(struct cpupri_vec *vec)
+{
+ if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL))
+ return -ENOMEM;
+
+ return 0;
+}
+
+static inline void free_vec_masks(struct cpupri_vec *vec)
+{
+ free_cpumask_var(vec->mask);
+}
+
+static inline int setup_vec_mask_var_ts(struct cpupri *cp)
+{
+ return 0;
+}
+
+static inline void free_vec_mask_var_ts(struct cpupri *cp)
+{
+}
+
+static inline int
+available_cpu_in_vec(struct task_struct *p, struct cpupri_vec *vec)
+{
+ if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids)
+ return 0;
+
+ return 1;
+}
+#endif
+
+static inline int alloc_all_masks(struct cpupri *cp)
+{
+ int i;
+
+ for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+ if (alloc_vec_masks(&cp->pri_to_cpu[i]))
+ goto cleanup;
+ }
+
+ return 0;
+
+cleanup:
+ while (i--)
+ free_vec_masks(&cp->pri_to_cpu[i]);
+
+ return -ENOMEM;
+}
+
+static inline void setup_vec_counts(struct cpupri *cp)
+{
+ for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+ struct cpupri_vec *vec = &cp->pri_to_cpu[i];
+
+ atomic_set(&vec->count, 0);
+ }
+}
+
static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
struct cpumask *lowest_mask, int idx)
{
@@ -96,11 +233,24 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
if (skip)
return 0;
- if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids)
+ if (!available_cpu_in_vec(p, vec))
return 0;
+#ifdef CONFIG_CPUMASK_OFFSTACK
+ struct cpumask *cpupri_mask = lowest_mask;
+
+ // available && lowest_mask
+ if (lowest_mask) {
+ cpumask_copy(cpupri_mask, vec->masks[0]);
+ for (int nid = 1; nid < nr_node_ids; nid++)
+ cpumask_and(cpupri_mask, cpupri_mask, vec->masks[nid]);
+ }
+#else
+ struct cpumask *cpupri_mask = vec->mask;
+#endif
+
if (lowest_mask) {
- cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+ cpumask_and(lowest_mask, &p->cpus_mask, cpupri_mask);
cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
/*
@@ -229,7 +379,11 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
if (likely(newpri != CPUPRI_INVALID)) {
struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
+#ifdef CONFIG_CPUMASK_OFFSTACK
+ cpumask_set_cpu(cpu, vec->masks[cpu_to_node(cpu)]);
+#else
cpumask_set_cpu(cpu, vec->mask);
+#endif
/*
* When adding a new vector, we update the mask first,
* do a write memory barrier, and then update the count, to
@@ -263,7 +417,11 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
*/
atomic_dec(&(vec)->count);
smp_mb__after_atomic();
+#ifdef CONFIG_CPUMASK_OFFSTACK
+ cpumask_clear_cpu(cpu, vec->masks[cpu_to_node(cpu)]);
+#else
cpumask_clear_cpu(cpu, vec->mask);
+#endif
}
*currpri = newpri;
@@ -279,26 +437,31 @@ int cpupri_init(struct cpupri *cp)
{
int i;
- for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
- struct cpupri_vec *vec = &cp->pri_to_cpu[i];
-
- atomic_set(&vec->count, 0);
- if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL))
- goto cleanup;
- }
-
+ /* Allocate the cpu_to_pri array */
cp->cpu_to_pri = kcalloc(nr_cpu_ids, sizeof(int), GFP_KERNEL);
if (!cp->cpu_to_pri)
- goto cleanup;
+ return -ENOMEM;
+ /* Initialize all CPUs to invalid priority */
for_each_possible_cpu(i)
cp->cpu_to_pri[i] = CPUPRI_INVALID;
+ /* Setup priority vectors */
+ setup_vec_counts(cp);
+ if (setup_vec_mask_var_ts(cp))
+ goto fail_setup_vectors;
+
+ /* Allocate masks for each priority vector */
+ if (alloc_all_masks(cp))
+ goto fail_alloc_masks;
+
return 0;
-cleanup:
- for (i--; i >= 0; i--)
- free_cpumask_var(cp->pri_to_cpu[i].mask);
+fail_alloc_masks:
+ free_vec_mask_var_ts(cp);
+
+fail_setup_vectors:
+ kfree(cp->cpu_to_pri);
return -ENOMEM;
}
@@ -308,9 +471,10 @@ int cpupri_init(struct cpupri *cp)
*/
void cpupri_cleanup(struct cpupri *cp)
{
- int i;
-
kfree(cp->cpu_to_pri);
- for (i = 0; i < CPUPRI_NR_PRIORITIES; i++)
- free_cpumask_var(cp->pri_to_cpu[i].mask);
+
+ for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++)
+ free_vec_masks(&cp->pri_to_cpu[i]);
+
+ free_vec_mask_var_ts(cp);
}
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index 245b0fa626be..c53f1f4dad86 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -9,7 +9,11 @@
struct cpupri_vec {
atomic_t count;
+#ifdef CONFIG_CPUMASK_OFFSTACK
+ cpumask_var_t *masks ____cacheline_aligned;
+#else
cpumask_var_t mask ____cacheline_aligned;
+#endif
};
struct cpupri {
--
2.43.5
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
2025-07-07 2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
@ 2025-07-07 6:53 ` kernel test robot
2025-07-07 11:36 ` Deng, Pan
2025-07-07 6:53 ` kernel test robot
2025-07-08 5:33 ` kernel test robot
2 siblings, 1 reply; 16+ messages in thread
From: kernel test robot @ 2025-07-07 6:53 UTC (permalink / raw)
To: Pan Deng, mingo
Cc: llvm, oe-kbuild-all, linux-kernel, tianyou.li, tim.c.chen,
yu.c.chen, pan.deng
Hi Pan,
kernel test robot noticed the following build warnings:
[auto build test WARNING on v6.16-rc5]
[also build test WARNING on linus/master]
[cannot apply to tip/sched/core peterz-queue/sched/core tip/master tip/auto-latest next-20250704]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Pan-Deng/sched-rt-Optimize-cpupri_vec-layout-to-mitigate-cache-line-contention/20250707-131831
base: v6.16-rc5
patch link: https://lore.kernel.org/r/2c1e1dbacaddd881f3cca340ece1f9268029b620.1751852370.git.pan.deng%40intel.com
patch subject: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20250707/202507071418.sFa0bilv-lkp@intel.com/config)
compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project 01c97b4953e87ae455bd4c41e3de3f0f0f29c61c)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250707/202507071418.sFa0bilv-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507071418.sFa0bilv-lkp@intel.com/
All warnings (new ones prefixed by >>):
In file included from kernel/sched/build_policy.c:52:
kernel/sched/rt.c:496:21: error: unknown type name 'atomic_tp'; did you mean 'atomic_t'?
496 | int rto_counts_init(atomic_tp **rto_counts)
| ^~~~~~~~~
| atomic_t
include/linux/types.h:183:3: note: 'atomic_t' declared here
183 | } atomic_t;
| ^
In file included from kernel/sched/build_policy.c:52:
>> kernel/sched/rt.c:496:5: warning: no previous prototype for function 'rto_counts_init' [-Wmissing-prototypes]
496 | int rto_counts_init(atomic_tp **rto_counts)
| ^
kernel/sched/rt.c:496:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
496 | int rto_counts_init(atomic_tp **rto_counts)
| ^
| static
kernel/sched/rt.c:501:25: error: unknown type name 'atomic_tp'; did you mean 'atomic_t'?
501 | void rto_counts_cleanup(atomic_tp *rto_counts)
| ^~~~~~~~~
| atomic_t
include/linux/types.h:183:3: note: 'atomic_t' declared here
183 | } atomic_t;
| ^
In file included from kernel/sched/build_policy.c:52:
>> kernel/sched/rt.c:501:6: warning: no previous prototype for function 'rto_counts_cleanup' [-Wmissing-prototypes]
501 | void rto_counts_cleanup(atomic_tp *rto_counts)
| ^
kernel/sched/rt.c:501:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
501 | void rto_counts_cleanup(atomic_tp *rto_counts)
| ^
| static
2 warnings and 2 errors generated.
vim +/rto_counts_init +496 kernel/sched/rt.c
495
> 496 int rto_counts_init(atomic_tp **rto_counts)
497 {
498 return 0;
499 }
500
> 501 void rto_counts_cleanup(atomic_tp *rto_counts)
502 {
503 }
504
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
2025-07-07 2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
2025-07-07 6:53 ` kernel test robot
@ 2025-07-07 6:53 ` kernel test robot
2025-07-08 5:33 ` kernel test robot
2 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2025-07-07 6:53 UTC (permalink / raw)
To: Pan Deng, peterz, mingo
Cc: oe-kbuild-all, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen,
pan.deng
Hi Pan,
kernel test robot noticed the following build errors:
[auto build test ERROR on v6.16-rc5]
[also build test ERROR on linus/master]
[cannot apply to tip/sched/core peterz-queue/sched/core tip/master tip/auto-latest next-20250704]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Pan-Deng/sched-rt-Optimize-cpupri_vec-layout-to-mitigate-cache-line-contention/20250707-131831
base: v6.16-rc5
patch link: https://lore.kernel.org/r/2c1e1dbacaddd881f3cca340ece1f9268029b620.1751852370.git.pan.deng%40intel.com
patch subject: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
config: arm-randconfig-002-20250707 (https://download.01.org/0day-ci/archive/20250707/202507071453.DYRB711b-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250707/202507071453.DYRB711b-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507071453.DYRB711b-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from kernel/sched/build_policy.c:52:
>> kernel/sched/rt.c:496:21: error: unknown type name 'atomic_tp'; did you mean 'atomic_t'?
496 | int rto_counts_init(atomic_tp **rto_counts)
| ^~~~~~~~~
| atomic_t
kernel/sched/rt.c:501:25: error: unknown type name 'atomic_tp'; did you mean 'atomic_t'?
501 | void rto_counts_cleanup(atomic_tp *rto_counts)
| ^~~~~~~~~
| atomic_t
vim +496 kernel/sched/rt.c
495
> 496 int rto_counts_init(atomic_tp **rto_counts)
497 {
498 return 0;
499 }
500
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
2025-07-07 6:53 ` kernel test robot
@ 2025-07-07 11:36 ` Deng, Pan
0 siblings, 0 replies; 16+ messages in thread
From: Deng, Pan @ 2025-07-07 11:36 UTC (permalink / raw)
To: lkp, mingo@kernel.org
Cc: llvm@lists.linux.dev, oe-kbuild-all@lists.linux.dev,
linux-kernel@vger.kernel.org, Li, Tianyou,
tim.c.chen@linux.intel.com, Chen, Yu C
The issue arises from redundant functions when CONFIG_SMP is disabled, it will be addressed along with other feedback in the next version.
Best Regards
Pan
> -----Original Message-----
> From: lkp <lkp@intel.com>
> Sent: Monday, July 7, 2025 2:53 PM
> To: Deng, Pan <pan.deng@intel.com>; mingo@kernel.org
> Cc: llvm@lists.linux.dev; oe-kbuild-all@lists.linux.dev; linux-
> kernel@vger.kernel.org; Li, Tianyou <tianyou.li@intel.com>;
> tim.c.chen@linux.intel.com; Chen, Yu C <yu.c.chen@intel.com>; Deng, Pan
> <pan.deng@intel.com>
> Subject: Re: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-
> node counters
>
> Hi Pan,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on v6.16-rc5]
> [also build test WARNING on linus/master] [cannot apply to tip/sched/core
> peterz-queue/sched/core tip/master tip/auto-latest next-20250704] [If your
> patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Pan-Deng/sched-rt-
> Optimize-cpupri_vec-layout-to-mitigate-cache-line-contention/20250707-
> 131831
> base: v6.16-rc5
> patch link:
> https://lore.kernel.org/r/2c1e1dbacaddd881f3cca340ece1f9268029b620.1751
> 852370.git.pan.deng%40intel.com
> patch subject: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-
> NUMA-node counters
> config: arm-allnoconfig (https://download.01.org/0day-
> ci/archive/20250707/202507071418.sFa0bilv-lkp@intel.com/config)
> compiler: clang version 21.0.0git (https://github.com/llvm/llvm-project
> 01c97b4953e87ae455bd4c41e3de3f0f0f29c61c)
> reproduce (this is a W=1 build): (https://download.01.org/0day-
> ci/archive/20250707/202507071418.sFa0bilv-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of the
> same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes:
> | https://lore.kernel.org/oe-kbuild-all/202507071418.sFa0bilv-lkp@intel.
> | com/
>
> All warnings (new ones prefixed by >>):
>
> In file included from kernel/sched/build_policy.c:52:
> kernel/sched/rt.c:496:21: error: unknown type name 'atomic_tp'; did you
> mean 'atomic_t'?
> 496 | int rto_counts_init(atomic_tp **rto_counts)
> | ^~~~~~~~~
> | atomic_t
> include/linux/types.h:183:3: note: 'atomic_t' declared here
> 183 | } atomic_t;
> | ^
> In file included from kernel/sched/build_policy.c:52:
> >> kernel/sched/rt.c:496:5: warning: no previous prototype for function
> >> 'rto_counts_init' [-Wmissing-prototypes]
> 496 | int rto_counts_init(atomic_tp **rto_counts)
> | ^
> kernel/sched/rt.c:496:1: note: declare 'static' if the function is not intended to
> be used outside of this translation unit
> 496 | int rto_counts_init(atomic_tp **rto_counts)
> | ^
> | static
> kernel/sched/rt.c:501:25: error: unknown type name 'atomic_tp'; did you
> mean 'atomic_t'?
> 501 | void rto_counts_cleanup(atomic_tp *rto_counts)
> | ^~~~~~~~~
> | atomic_t
> include/linux/types.h:183:3: note: 'atomic_t' declared here
> 183 | } atomic_t;
> | ^
> In file included from kernel/sched/build_policy.c:52:
> >> kernel/sched/rt.c:501:6: warning: no previous prototype for function
> >> 'rto_counts_cleanup' [-Wmissing-prototypes]
> 501 | void rto_counts_cleanup(atomic_tp *rto_counts)
> | ^
> kernel/sched/rt.c:501:1: note: declare 'static' if the function is not intended to
> be used outside of this translation unit
> 501 | void rto_counts_cleanup(atomic_tp *rto_counts)
> | ^
> | static
> 2 warnings and 2 errors generated.
>
>
> vim +/rto_counts_init +496 kernel/sched/rt.c
>
> 495
> > 496 int rto_counts_init(atomic_tp **rto_counts)
> 497 {
> 498 return 0;
> 499 }
> 500
> > 501 void rto_counts_cleanup(atomic_tp *rto_counts)
> 502 {
> 503 }
> 504
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
2025-07-07 2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
2025-07-07 6:53 ` kernel test robot
2025-07-07 6:53 ` kernel test robot
@ 2025-07-08 5:33 ` kernel test robot
2025-07-08 14:02 ` Deng, Pan
2 siblings, 1 reply; 16+ messages in thread
From: kernel test robot @ 2025-07-08 5:33 UTC (permalink / raw)
To: Pan Deng, peterz, mingo
Cc: oe-kbuild-all, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen,
pan.deng
Hi Pan,
kernel test robot noticed the following build warnings:
[auto build test WARNING on v6.16-rc5]
[also build test WARNING on linus/master]
[cannot apply to tip/sched/core peterz-queue/sched/core tip/master tip/auto-latest next-20250704]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Pan-Deng/sched-rt-Optimize-cpupri_vec-layout-to-mitigate-cache-line-contention/20250707-131831
base: v6.16-rc5
patch link: https://lore.kernel.org/r/2c1e1dbacaddd881f3cca340ece1f9268029b620.1751852370.git.pan.deng%40intel.com
patch subject: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
config: loongarch-randconfig-r112-20250708 (https://download.01.org/0day-ci/archive/20250708/202507081317.4IdE2euZ-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 15.1.0
reproduce: (https://download.01.org/0day-ci/archive/20250708/202507081317.4IdE2euZ-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507081317.4IdE2euZ-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
kernel/sched/rt.c:1679:45: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *donor @@
kernel/sched/rt.c:1679:45: sparse: expected struct task_struct *p
kernel/sched/rt.c:1679:45: sparse: got struct task_struct [noderef] __rcu *donor
kernel/sched/rt.c:1722:39: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *donor @@ got struct task_struct [noderef] __rcu *donor @@
kernel/sched/rt.c:1722:39: sparse: expected struct task_struct *donor
kernel/sched/rt.c:1722:39: sparse: got struct task_struct [noderef] __rcu *donor
kernel/sched/rt.c:1742:64: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/rt.c:1742:64: sparse: expected struct task_struct *tsk
kernel/sched/rt.c:1742:64: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/rt.c:2084:40: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *task @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/rt.c:2084:40: sparse: expected struct task_struct *task
kernel/sched/rt.c:2084:40: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/rt.c:2107:13: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/rt.c:2107:13: sparse: struct task_struct *
kernel/sched/rt.c:2107:13: sparse: struct task_struct [noderef] __rcu *
kernel/sched/rt.c:2453:54: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/rt.c:2453:54: sparse: expected struct task_struct *tsk
kernel/sched/rt.c:2453:54: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/rt.c:2455:40: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *donor @@
kernel/sched/rt.c:2455:40: sparse: expected struct task_struct *p
kernel/sched/rt.c:2455:40: sparse: got struct task_struct [noderef] __rcu *donor
kernel/sched/rt.c:2455:62: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *donor @@
kernel/sched/rt.c:2455:62: sparse: expected struct task_struct *p
kernel/sched/rt.c:2455:62: sparse: got struct task_struct [noderef] __rcu *donor
kernel/sched/build_policy.c: note: in included file:
kernel/sched/deadline.c:2717:23: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *donor @@
kernel/sched/deadline.c:2717:23: sparse: expected struct task_struct *p
kernel/sched/deadline.c:2717:23: sparse: got struct task_struct [noderef] __rcu *donor
kernel/sched/deadline.c:2727:13: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/deadline.c:2727:13: sparse: struct task_struct *
kernel/sched/deadline.c:2727:13: sparse: struct task_struct [noderef] __rcu *
kernel/sched/deadline.c:2833:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/deadline.c:2833:25: sparse: struct task_struct *
kernel/sched/deadline.c:2833:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/deadline.c:2357:42: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected struct sched_dl_entity const *b @@ got struct sched_dl_entity [noderef] __rcu * @@
kernel/sched/deadline.c:2357:42: sparse: expected struct sched_dl_entity const *b
kernel/sched/deadline.c:2357:42: sparse: got struct sched_dl_entity [noderef] __rcu *
kernel/sched/deadline.c:2368:38: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/deadline.c:2368:38: sparse: expected struct task_struct *tsk
kernel/sched/deadline.c:2368:38: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/deadline.c:1262:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/deadline.c:1262:39: sparse: expected struct task_struct *p
kernel/sched/deadline.c:1262:39: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/deadline.c:1262:85: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected struct sched_dl_entity const *b @@ got struct sched_dl_entity [noderef] __rcu * @@
kernel/sched/deadline.c:1262:85: sparse: expected struct sched_dl_entity const *b
kernel/sched/deadline.c:1262:85: sparse: got struct sched_dl_entity [noderef] __rcu *
kernel/sched/deadline.c:1362:23: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *donor @@
kernel/sched/deadline.c:1362:23: sparse: expected struct task_struct *p
kernel/sched/deadline.c:1362:23: sparse: got struct task_struct [noderef] __rcu *donor
kernel/sched/deadline.c:1671:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/deadline.c:1671:31: sparse: expected struct task_struct *p
kernel/sched/deadline.c:1671:31: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/deadline.c:1671:70: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected struct sched_dl_entity const *b @@ got struct sched_dl_entity [noderef] __rcu * @@
kernel/sched/deadline.c:1671:70: sparse: expected struct sched_dl_entity const *b
kernel/sched/deadline.c:1671:70: sparse: got struct sched_dl_entity [noderef] __rcu *
kernel/sched/deadline.c:1760:39: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *donor @@ got struct task_struct [noderef] __rcu *donor @@
kernel/sched/deadline.c:1760:39: sparse: expected struct task_struct *donor
kernel/sched/deadline.c:1760:39: sparse: got struct task_struct [noderef] __rcu *donor
kernel/sched/deadline.c:2578:9: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sched_domain *[assigned] sd @@ got struct sched_domain [noderef] __rcu *parent @@
kernel/sched/deadline.c:2578:9: sparse: expected struct sched_domain *[assigned] sd
kernel/sched/deadline.c:2578:9: sparse: got struct sched_domain [noderef] __rcu *parent
kernel/sched/deadline.c:2242:14: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct task_struct *curr @@ got struct task_struct [noderef] __rcu * @@
kernel/sched/deadline.c:2242:14: sparse: expected struct task_struct *curr
kernel/sched/deadline.c:2242:14: sparse: got struct task_struct [noderef] __rcu *
kernel/sched/deadline.c:2243:15: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct task_struct *donor @@ got struct task_struct [noderef] __rcu * @@
kernel/sched/deadline.c:2243:15: sparse: expected struct task_struct *donor
kernel/sched/deadline.c:2243:15: sparse: got struct task_struct [noderef] __rcu *
kernel/sched/deadline.c:2318:43: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *donor @@
kernel/sched/deadline.c:2318:43: sparse: expected struct task_struct *p
kernel/sched/deadline.c:2318:43: sparse: got struct task_struct [noderef] __rcu *donor
kernel/sched/deadline.c:2878:38: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/deadline.c:2878:38: sparse: expected struct task_struct *tsk
kernel/sched/deadline.c:2878:38: sparse: got struct task_struct [noderef] __rcu *curr
kernel/sched/deadline.c:2880:23: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *donor @@
kernel/sched/deadline.c:2880:23: sparse: expected struct task_struct *p
kernel/sched/deadline.c:2880:23: sparse: got struct task_struct [noderef] __rcu *donor
kernel/sched/deadline.c:2882:44: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected struct sched_dl_entity const *b @@ got struct sched_dl_entity [noderef] __rcu * @@
kernel/sched/deadline.c:2882:44: sparse: expected struct sched_dl_entity const *b
kernel/sched/deadline.c:2882:44: sparse: got struct sched_dl_entity [noderef] __rcu *
kernel/sched/deadline.c:3071:23: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/deadline.c:3071:23: sparse: struct task_struct [noderef] __rcu *
kernel/sched/deadline.c:3071:23: sparse: struct task_struct *
kernel/sched/deadline.c:3120:32: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *curr @@
kernel/sched/build_policy.c: note: in included file:
kernel/sched/syscalls.c:206:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/syscalls.c:206:22: sparse: struct task_struct [noderef] __rcu *
kernel/sched/syscalls.c:206:22: sparse: struct task_struct *
kernel/sched/build_policy.c: note: in included file:
kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2276:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2276:25: sparse: struct task_struct *
kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2276:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2276:25: sparse: struct task_struct *
kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2287:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2287:26: sparse: struct task_struct *
kernel/sched/build_policy.c: note: in included file:
kernel/sched/rt.c:2413:45: sparse: sparse: dereference of noderef expression
kernel/sched/build_policy.c: note: in included file:
>> kernel/sched/sched.h:2627:35: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct task_struct *p @@ got struct task_struct [noderef] __rcu *donor @@
kernel/sched/build_policy.c: note: in included file:
kernel/sched/rt.c:2456:32: sparse: sparse: dereference of noderef expression
kernel/sched/rt.c:2457:32: sparse: sparse: dereference of noderef expression
kernel/sched/build_policy.c: note: in included file:
kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2276:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2276:25: sparse: struct task_struct *
kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2287:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2287:26: sparse: struct task_struct *
kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2276:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2276:25: sparse: struct task_struct *
kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2276:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2276:25: sparse: struct task_struct *
kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2287:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2287:26: sparse: struct task_struct *
kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2287:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2287:26: sparse: struct task_struct *
kernel/sched/sched.h:2476:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2476:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2476:9: sparse: struct task_struct *
kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2287:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2287:26: sparse: struct task_struct *
kernel/sched/sched.h:2476:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2476:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2476:9: sparse: struct task_struct *
kernel/sched/build_policy.c: note: in included file:
kernel/sched/syscalls.c:1296:6: sparse: sparse: context imbalance in 'sched_getaffinity' - different lock contexts for basic block
kernel/sched/build_policy.c: note: in included file:
kernel/sched/rt.c:1767:15: sparse: sparse: dereference of noderef expression
vim +2627 kernel/sched/sched.h
04746ed80bcf31 Ingo Molnar 2024-04-07 2624
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2625 static inline struct task_struct *get_push_task(struct rq *rq)
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2626 {
af0c8b2bf67b25 Peter Zijlstra 2024-10-09 @2627 struct task_struct *p = rq->donor;
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2628
5cb9eaa3d274f7 Peter Zijlstra 2020-11-17 2629 lockdep_assert_rq_held(rq);
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2630
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2631 if (rq->push_busy)
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2632 return NULL;
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2633
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2634 if (p->nr_cpus_allowed == 1)
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2635 return NULL;
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2636
e681dcbaa4b284 Sebastian Andrzej Siewior 2021-08-26 2637 if (p->migration_disabled)
e681dcbaa4b284 Sebastian Andrzej Siewior 2021-08-26 2638 return NULL;
e681dcbaa4b284 Sebastian Andrzej Siewior 2021-08-26 2639
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2640 rq->push_busy = true;
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2641 return get_task_struct(p);
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2642 }
a7c81556ec4d34 Peter Zijlstra 2020-09-28 2643
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
2025-07-08 5:33 ` kernel test robot
@ 2025-07-08 14:02 ` Deng, Pan
2025-07-09 8:56 ` Li, Philip
0 siblings, 1 reply; 16+ messages in thread
From: Deng, Pan @ 2025-07-08 14:02 UTC (permalink / raw)
To: lkp, peterz@infradead.org, mingo@kernel.org
Cc: oe-kbuild-all@lists.linux.dev, linux-kernel@vger.kernel.org,
Li, Tianyou, tim.c.chen@linux.intel.com, Chen, Yu C
> -----Original Message-----
> From: lkp <lkp@intel.com>
> Sent: Tuesday, July 8, 2025 1:34 PM
> To: Deng, Pan <pan.deng@intel.com>; peterz@infradead.org; mingo@kernel.org
> Cc: oe-kbuild-all@lists.linux.dev; linux-kernel@vger.kernel.org; Li, Tianyou
> <tianyou.li@intel.com>; tim.c.chen@linux.intel.com; Chen, Yu C
> <yu.c.chen@intel.com>; Deng, Pan <pan.deng@intel.com>
> Subject: Re: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-
> node counters
>
> Hi Pan,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on v6.16-rc5]
> [also build test WARNING on linus/master] [cannot apply to tip/sched/core
> peterz-queue/sched/core tip/master tip/auto-latest next-20250704] [If your
> patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Pan-Deng/sched-rt-
> Optimize-cpupri_vec-layout-to-mitigate-cache-line-contention/20250707-131831
> base: v6.16-rc5
> patch link:
> https://lore.kernel.org/r/2c1e1dbacaddd881f3cca340ece1f9268029b620.175185
> 2370.git.pan.deng%40intel.com
> patch subject: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-
> node counters
> config: loongarch-randconfig-r112-20250708 (https://download.01.org/0day-
> ci/archive/20250708/202507081317.4IdE2euZ-lkp@intel.com/config)
> compiler: loongarch64-linux-gcc (GCC) 15.1.0
> reproduce: (https://download.01.org/0day-
> ci/archive/20250708/202507081317.4IdE2euZ-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of the
> same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes:
> | https://lore.kernel.org/oe-kbuild-all/202507081317.4IdE2euZ-lkp@intel.
> | com/
>
> sparse warnings: (new ones prefixed by >>)
> kernel/sched/rt.c:1679:45: sparse: sparse: incorrect type in argument 2
> (different address spaces) @@ expected struct task_struct *p @@ got struct
> task_struct [noderef] __rcu *donor @@
> kernel/sched/rt.c:1679:45: sparse: expected struct task_struct *p
> kernel/sched/rt.c:1679:45: sparse: got struct task_struct [noderef] __rcu
> *donor
> kernel/sched/rt.c:1722:39: sparse: sparse: incorrect type in initializer (different
> address spaces) @@ expected struct task_struct *donor @@ got struct
> task_struct [noderef] __rcu *donor @@
> kernel/sched/rt.c:1722:39: sparse: expected struct task_struct *donor
> kernel/sched/rt.c:1722:39: sparse: got struct task_struct [noderef] __rcu
> *donor
> kernel/sched/rt.c:1742:64: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@ expected struct task_struct *tsk @@ got
> struct task_struct [noderef] __rcu *curr @@
> kernel/sched/rt.c:1742:64: sparse: expected struct task_struct *tsk
> kernel/sched/rt.c:1742:64: sparse: got struct task_struct [noderef] __rcu
> *curr
> kernel/sched/rt.c:2084:40: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@ expected struct task_struct *task @@ got
> struct task_struct [noderef] __rcu *curr @@
> kernel/sched/rt.c:2084:40: sparse: expected struct task_struct *task
> kernel/sched/rt.c:2084:40: sparse: got struct task_struct [noderef] __rcu
> *curr
> kernel/sched/rt.c:2107:13: sparse: sparse: incompatible types in comparison
> expression (different address spaces):
> kernel/sched/rt.c:2107:13: sparse: struct task_struct *
> kernel/sched/rt.c:2107:13: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/rt.c:2453:54: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@ expected struct task_struct *tsk @@ got
> struct task_struct [noderef] __rcu *curr @@
> kernel/sched/rt.c:2453:54: sparse: expected struct task_struct *tsk
> kernel/sched/rt.c:2453:54: sparse: got struct task_struct [noderef] __rcu
> *curr
> kernel/sched/rt.c:2455:40: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@ expected struct task_struct *p @@ got struct
> task_struct [noderef] __rcu *donor @@
> kernel/sched/rt.c:2455:40: sparse: expected struct task_struct *p
> kernel/sched/rt.c:2455:40: sparse: got struct task_struct [noderef] __rcu
> *donor
> kernel/sched/rt.c:2455:62: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@ expected struct task_struct *p @@ got struct
> task_struct [noderef] __rcu *donor @@
> kernel/sched/rt.c:2455:62: sparse: expected struct task_struct *p
> kernel/sched/rt.c:2455:62: sparse: got struct task_struct [noderef] __rcu
> *donor
> kernel/sched/build_policy.c: note: in included file:
> kernel/sched/deadline.c:2717:23: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@ expected struct task_struct *p @@ got struct
> task_struct [noderef] __rcu *donor @@
> kernel/sched/deadline.c:2717:23: sparse: expected struct task_struct *p
> kernel/sched/deadline.c:2717:23: sparse: got struct task_struct [noderef]
> __rcu *donor
> kernel/sched/deadline.c:2727:13: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/deadline.c:2727:13: sparse: struct task_struct *
> kernel/sched/deadline.c:2727:13: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/deadline.c:2833:25: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/deadline.c:2833:25: sparse: struct task_struct *
> kernel/sched/deadline.c:2833:25: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/deadline.c:2357:42: sparse: sparse: incorrect type in argument 2
> (different address spaces) @@ expected struct sched_dl_entity const *b @@
> got struct sched_dl_entity [noderef] __rcu * @@
> kernel/sched/deadline.c:2357:42: sparse: expected struct sched_dl_entity
> const *b
> kernel/sched/deadline.c:2357:42: sparse: got struct sched_dl_entity
> [noderef] __rcu *
> kernel/sched/deadline.c:2368:38: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@ expected struct task_struct *tsk @@ got
> struct task_struct [noderef] __rcu *curr @@
> kernel/sched/deadline.c:2368:38: sparse: expected struct task_struct *tsk
> kernel/sched/deadline.c:2368:38: sparse: got struct task_struct [noderef]
> __rcu *curr
> kernel/sched/deadline.c:1262:39: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@ expected struct task_struct *p @@ got struct
> task_struct [noderef] __rcu *curr @@
> kernel/sched/deadline.c:1262:39: sparse: expected struct task_struct *p
> kernel/sched/deadline.c:1262:39: sparse: got struct task_struct [noderef]
> __rcu *curr
> kernel/sched/deadline.c:1262:85: sparse: sparse: incorrect type in argument 2
> (different address spaces) @@ expected struct sched_dl_entity const *b @@
> got struct sched_dl_entity [noderef] __rcu * @@
> kernel/sched/deadline.c:1262:85: sparse: expected struct sched_dl_entity
> const *b
> kernel/sched/deadline.c:1262:85: sparse: got struct sched_dl_entity
> [noderef] __rcu *
> kernel/sched/deadline.c:1362:23: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@ expected struct task_struct *p @@ got struct
> task_struct [noderef] __rcu *donor @@
> kernel/sched/deadline.c:1362:23: sparse: expected struct task_struct *p
> kernel/sched/deadline.c:1362:23: sparse: got struct task_struct [noderef]
> __rcu *donor
> kernel/sched/deadline.c:1671:31: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@ expected struct task_struct *p @@ got struct
> task_struct [noderef] __rcu *curr @@
> kernel/sched/deadline.c:1671:31: sparse: expected struct task_struct *p
> kernel/sched/deadline.c:1671:31: sparse: got struct task_struct [noderef]
> __rcu *curr
> kernel/sched/deadline.c:1671:70: sparse: sparse: incorrect type in argument 2
> (different address spaces) @@ expected struct sched_dl_entity const *b @@
> got struct sched_dl_entity [noderef] __rcu * @@
> kernel/sched/deadline.c:1671:70: sparse: expected struct sched_dl_entity
> const *b
> kernel/sched/deadline.c:1671:70: sparse: got struct sched_dl_entity
> [noderef] __rcu *
> kernel/sched/deadline.c:1760:39: sparse: sparse: incorrect type in initializer
> (different address spaces) @@ expected struct task_struct *donor @@ got
> struct task_struct [noderef] __rcu *donor @@
> kernel/sched/deadline.c:1760:39: sparse: expected struct task_struct *donor
> kernel/sched/deadline.c:1760:39: sparse: got struct task_struct [noderef]
> __rcu *donor
> kernel/sched/deadline.c:2578:9: sparse: sparse: incorrect type in assignment
> (different address spaces) @@ expected struct sched_domain *[assigned] sd
> @@ got struct sched_domain [noderef] __rcu *parent @@
> kernel/sched/deadline.c:2578:9: sparse: expected struct sched_domain
> *[assigned] sd
> kernel/sched/deadline.c:2578:9: sparse: got struct sched_domain [noderef]
> __rcu *parent
> kernel/sched/deadline.c:2242:14: sparse: sparse: incorrect type in assignment
> (different address spaces) @@ expected struct task_struct *curr @@ got
> struct task_struct [noderef] __rcu * @@
> kernel/sched/deadline.c:2242:14: sparse: expected struct task_struct *curr
> kernel/sched/deadline.c:2242:14: sparse: got struct task_struct [noderef]
> __rcu *
> kernel/sched/deadline.c:2243:15: sparse: sparse: incorrect type in assignment
> (different address spaces) @@ expected struct task_struct *donor @@ got
> struct task_struct [noderef] __rcu * @@
> kernel/sched/deadline.c:2243:15: sparse: expected struct task_struct *donor
> kernel/sched/deadline.c:2243:15: sparse: got struct task_struct [noderef]
> __rcu *
> kernel/sched/deadline.c:2318:43: sparse: sparse: incorrect type in argument 2
> (different address spaces) @@ expected struct task_struct *p @@ got struct
> task_struct [noderef] __rcu *donor @@
> kernel/sched/deadline.c:2318:43: sparse: expected struct task_struct *p
> kernel/sched/deadline.c:2318:43: sparse: got struct task_struct [noderef]
> __rcu *donor
> kernel/sched/deadline.c:2878:38: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@ expected struct task_struct *tsk @@ got
> struct task_struct [noderef] __rcu *curr @@
> kernel/sched/deadline.c:2878:38: sparse: expected struct task_struct *tsk
> kernel/sched/deadline.c:2878:38: sparse: got struct task_struct [noderef]
> __rcu *curr
> kernel/sched/deadline.c:2880:23: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@ expected struct task_struct *p @@ got struct
> task_struct [noderef] __rcu *donor @@
> kernel/sched/deadline.c:2880:23: sparse: expected struct task_struct *p
> kernel/sched/deadline.c:2880:23: sparse: got struct task_struct [noderef]
> __rcu *donor
> kernel/sched/deadline.c:2882:44: sparse: sparse: incorrect type in argument 2
> (different address spaces) @@ expected struct sched_dl_entity const *b @@
> got struct sched_dl_entity [noderef] __rcu * @@
> kernel/sched/deadline.c:2882:44: sparse: expected struct sched_dl_entity
> const *b
> kernel/sched/deadline.c:2882:44: sparse: got struct sched_dl_entity
> [noderef] __rcu *
> kernel/sched/deadline.c:3071:23: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/deadline.c:3071:23: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/deadline.c:3071:23: sparse: struct task_struct *
> kernel/sched/deadline.c:3120:32: sparse: sparse: incorrect type in argument 1
> (different address spaces) @@ expected struct task_struct *p @@ got struct
> task_struct [noderef] __rcu *curr @@
> kernel/sched/build_policy.c: note: in included file:
> kernel/sched/syscalls.c:206:22: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/syscalls.c:206:22: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/syscalls.c:206:22: sparse: struct task_struct *
> kernel/sched/build_policy.c: note: in included file:
> kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/sched.h:2276:25: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/sched.h:2276:25: sparse: struct task_struct *
> kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/sched.h:2276:25: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/sched.h:2276:25: sparse: struct task_struct *
> kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/sched.h:2287:26: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/sched.h:2287:26: sparse: struct task_struct *
> kernel/sched/build_policy.c: note: in included file:
> kernel/sched/rt.c:2413:45: sparse: sparse: dereference of noderef expression
> kernel/sched/build_policy.c: note: in included file:
> >> kernel/sched/sched.h:2627:35: sparse: sparse: incorrect type in initializer
This warning is not about the change we made, @lkp, could you please check it?
> (different address spaces) @@ expected struct task_struct *p @@ got struct
> task_struct [noderef] __rcu *donor @@
> kernel/sched/build_policy.c: note: in included file:
> kernel/sched/rt.c:2456:32: sparse: sparse: dereference of noderef expression
> kernel/sched/rt.c:2457:32: sparse: sparse: dereference of noderef expression
> kernel/sched/build_policy.c: note: in included file:
> kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/sched.h:2276:25: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/sched.h:2276:25: sparse: struct task_struct *
> kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/sched.h:2287:26: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/sched.h:2287:26: sparse: struct task_struct *
> kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/sched.h:2276:25: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/sched.h:2276:25: sparse: struct task_struct *
> kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/sched.h:2276:25: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/sched.h:2276:25: sparse: struct task_struct *
> kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/sched.h:2287:26: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/sched.h:2287:26: sparse: struct task_struct *
> kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/sched.h:2287:26: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/sched.h:2287:26: sparse: struct task_struct *
> kernel/sched/sched.h:2476:9: sparse: sparse: incompatible types in comparison
> expression (different address spaces):
> kernel/sched/sched.h:2476:9: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/sched.h:2476:9: sparse: struct task_struct *
> kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in
> comparison expression (different address spaces):
> kernel/sched/sched.h:2287:26: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/sched.h:2287:26: sparse: struct task_struct *
> kernel/sched/sched.h:2476:9: sparse: sparse: incompatible types in comparison
> expression (different address spaces):
> kernel/sched/sched.h:2476:9: sparse: struct task_struct [noderef] __rcu *
> kernel/sched/sched.h:2476:9: sparse: struct task_struct *
> kernel/sched/build_policy.c: note: in included file:
> kernel/sched/syscalls.c:1296:6: sparse: sparse: context imbalance in
> 'sched_getaffinity' - different lock contexts for basic block
> kernel/sched/build_policy.c: note: in included file:
> kernel/sched/rt.c:1767:15: sparse: sparse: dereference of noderef expression
>
> vim +2627 kernel/sched/sched.h
>
> 04746ed80bcf31 Ingo Molnar 2024-04-07 2624
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2625 static inline struct
> task_struct *get_push_task(struct rq *rq)
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2626 {
> af0c8b2bf67b25 Peter Zijlstra 2024-10-09 @2627 struct task_struct *p =
> rq->donor;
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2628
> 5cb9eaa3d274f7 Peter Zijlstra 2020-11-17 2629
> lockdep_assert_rq_held(rq);
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2630
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2631 if (rq->push_busy)
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2632 return NULL;
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2633
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2634 if (p->nr_cpus_allowed
> == 1)
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2635 return NULL;
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2636
> e681dcbaa4b284 Sebastian Andrzej Siewior 2021-08-26 2637 if (p-
> >migration_disabled)
> e681dcbaa4b284 Sebastian Andrzej Siewior 2021-08-26 2638 return
> NULL;
> e681dcbaa4b284 Sebastian Andrzej Siewior 2021-08-26 2639
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2640 rq->push_busy = true;
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2641 return
> get_task_struct(p);
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2642 }
> a7c81556ec4d34 Peter Zijlstra 2020-09-28 2643
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
2025-07-08 14:02 ` Deng, Pan
@ 2025-07-09 8:56 ` Li, Philip
0 siblings, 0 replies; 16+ messages in thread
From: Li, Philip @ 2025-07-09 8:56 UTC (permalink / raw)
To: Deng, Pan, lkp, peterz@infradead.org, mingo@kernel.org
Cc: oe-kbuild-all@lists.linux.dev, linux-kernel@vger.kernel.org,
Li, Tianyou, tim.c.chen@linux.intel.com, Chen, Yu C
> > comparison expression (different address spaces):
> > kernel/sched/sched.h:2276:25: sparse: struct task_struct [noderef] __rcu *
> > kernel/sched/sched.h:2276:25: sparse: struct task_struct *
> > kernel/sched/sched.h:2276:25: sparse: sparse: incompatible types in
> > comparison expression (different address spaces):
> > kernel/sched/sched.h:2276:25: sparse: struct task_struct [noderef] __rcu *
> > kernel/sched/sched.h:2276:25: sparse: struct task_struct *
> > kernel/sched/sched.h:2287:26: sparse: sparse: incompatible types in
> > comparison expression (different address spaces):
> > kernel/sched/sched.h:2287:26: sparse: struct task_struct [noderef] __rcu *
> > kernel/sched/sched.h:2287:26: sparse: struct task_struct *
> > kernel/sched/build_policy.c: note: in included file:
> > kernel/sched/rt.c:2413:45: sparse: sparse: dereference of noderef expression
> > kernel/sched/build_policy.c: note: in included file:
> > >> kernel/sched/sched.h:2627:35: sparse: sparse: incorrect type in initializer
> This warning is not about the change we made, @lkp, could you please check it?
Sorry for this false report, it should not be related to your changes. We will follow
up to figure out what is wrong during the bisections. Sorry for wasting your time.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
2025-07-07 2:35 ` [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
@ 2025-07-21 11:23 ` Chen, Yu C
2025-07-22 14:46 ` Deng, Pan
0 siblings, 1 reply; 16+ messages in thread
From: Chen, Yu C @ 2025-07-21 11:23 UTC (permalink / raw)
To: Pan Deng; +Cc: linux-kernel, tianyou.li, tim.c.chen, peterz, mingo
On 7/7/2025 10:35 AM, Pan Deng wrote:
> When running a multi-instance FFmpeg workload on HCC system, significant
> contention is observed on bitmap of `cpupri_vec->cpumask`.
>
> The SUT is a 2-socket machine with 240 physical cores and 480 logical
> CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
> (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> with FIFO scheduling. FPS is used as score.
>
> perf c2c tool reveals:
> cpumask (bitmap) cache line of `cpupri_vec->mask`:
> - bits are loaded during cpupri_find
> - bits are stored during cpupri_set
> - cycles per load: ~2.2K to 8.7K
>
> This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
> mitigate false sharing.
>
> As a result:
> - FPS improves by ~3.8%
> - Kernel cycles% drops from ~20% to ~18.7%
> - Cache line contention is mitigated, perf-c2c shows cycles per load
> drops from ~2.2K-8.7K to ~0.5K-2.2K
>
This brings noticeable improvement for RT workload, and it would
be even more convincing if we can have try on normal task workload,
at least not bring regression(schbench/hackbenc, etc).
thanks,
Chenyu
> Note: CONFIG_CPUMASK_OFFSTACK=n remains unchanged.
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
2025-07-21 11:23 ` Chen, Yu C
@ 2025-07-22 14:46 ` Deng, Pan
2025-08-06 14:00 ` Deng, Pan
0 siblings, 1 reply; 16+ messages in thread
From: Deng, Pan @ 2025-07-22 14:46 UTC (permalink / raw)
To: Chen, Yu C
Cc: linux-kernel@vger.kernel.org, Li, Tianyou,
tim.c.chen@linux.intel.com, peterz@infradead.org,
mingo@kernel.org
> -----Original Message-----
> From: Chen, Yu C <yu.c.chen@intel.com>
> Sent: Monday, July 21, 2025 7:24 PM
> To: Deng, Pan <pan.deng@intel.com>
> Cc: linux-kernel@vger.kernel.org; Li, Tianyou <tianyou.li@intel.com>;
> tim.c.chen@linux.intel.com; peterz@infradead.org; mingo@kernel.org
> Subject: Re: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA
> node to reduce contention
>
> On 7/7/2025 10:35 AM, Pan Deng wrote:
> > When running a multi-instance FFmpeg workload on HCC system,
> > significant contention is observed on bitmap of `cpupri_vec->cpumask`.
> >
> > The SUT is a 2-socket machine with 240 physical cores and 480 logical
> > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical
> > cores
> > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> > with FIFO scheduling. FPS is used as score.
> >
> > perf c2c tool reveals:
> > cpumask (bitmap) cache line of `cpupri_vec->mask`:
> > - bits are loaded during cpupri_find
> > - bits are stored during cpupri_set
> > - cycles per load: ~2.2K to 8.7K
> >
> > This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
> > mitigate false sharing.
> >
> > As a result:
> > - FPS improves by ~3.8%
> > - Kernel cycles% drops from ~20% to ~18.7%
> > - Cache line contention is mitigated, perf-c2c shows cycles per load
> > drops from ~2.2K-8.7K to ~0.5K-2.2K
> >
>
> This brings noticeable improvement for RT workload, and it would be even
> more convincing if we can have try on normal task workload, at least not bring
> regression(schbench/hackbenc, etc).
>
Thanks Yu, hackbench and schbench data will be provided later.
> thanks,
> Chenyu
>
> > Note: CONFIG_CPUMASK_OFFSTACK=n remains unchanged.
> >
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
2025-07-22 14:46 ` Deng, Pan
@ 2025-08-06 14:00 ` Deng, Pan
0 siblings, 0 replies; 16+ messages in thread
From: Deng, Pan @ 2025-08-06 14:00 UTC (permalink / raw)
To: Chen, Yu C
Cc: linux-kernel@vger.kernel.org, Li, Tianyou,
tim.c.chen@linux.intel.com, peterz@infradead.org,
mingo@kernel.org
> -----Original Message-----
> From: Deng, Pan
> Sent: Tuesday, July 22, 2025 10:47 PM
> To: Chen, Yu C <yu.c.chen@intel.com>
> Cc: linux-kernel@vger.kernel.org; Li, Tianyou <tianyou.li@intel.com>;
> tim.c.chen@linux.intel.com; peterz@infradead.org; mingo@kernel.org
> Subject: RE: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA
> node to reduce contention
>
>
> > -----Original Message-----
> > From: Chen, Yu C <yu.c.chen@intel.com>
> > Sent: Monday, July 21, 2025 7:24 PM
> > To: Deng, Pan <pan.deng@intel.com>
> > Cc: linux-kernel@vger.kernel.org; Li, Tianyou <tianyou.li@intel.com>;
> > tim.c.chen@linux.intel.com; peterz@infradead.org; mingo@kernel.org
> > Subject: Re: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA
> > node to reduce contention
> >
> > On 7/7/2025 10:35 AM, Pan Deng wrote:
> > > When running a multi-instance FFmpeg workload on HCC system,
> > > significant contention is observed on bitmap of `cpupri_vec->cpumask`.
> > >
> > > The SUT is a 2-socket machine with 240 physical cores and 480 logical
> > > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical
> > > cores
> > > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> > > with FIFO scheduling. FPS is used as score.
> > >
> > > perf c2c tool reveals:
> > > cpumask (bitmap) cache line of `cpupri_vec->mask`:
> > > - bits are loaded during cpupri_find
> > > - bits are stored during cpupri_set
> > > - cycles per load: ~2.2K to 8.7K
> > >
> > > This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
> > > mitigate false sharing.
> > >
> > > As a result:
> > > - FPS improves by ~3.8%
> > > - Kernel cycles% drops from ~20% to ~18.7%
> > > - Cache line contention is mitigated, perf-c2c shows cycles per load
> > > drops from ~2.2K-8.7K to ~0.5K-2.2K
> > >
> >
> > This brings noticeable improvement for RT workload, and it would be even
> > more convincing if we can have try on normal task workload, at least not
> bring
> > regression(schbench/hackbenc, etc).
> >
>
> Thanks Yu, hackbench and schbench data will be provided later.
>
>
TLDR;
====
Hackbench, both old and new version schbench were evaluted on SUT with
2-socket/6 NUMA nodes/240 physical cores/480 logical CPUs. No regressions
were detected for patch 1-4. In addition, symbol-level analysis from
`perf record -a` profiling data indicates that the changes introduced
in patch 1-4 are unlikely to cause regressions in hackbench or schbench.
Details
=======
Hackbench
=========
The workload is ran by test framework
https://github.com/yu-chen-surf/schedtests, procedures:
1. Reboot system to run a workload.
2. Run 5 iterations of 1st configuration with 30s cool down period.
3. Run 5 iterations of 2nd configuration..
...
The test results are as follows: regressions exceeding -10% are marked
with ** at the end of the line. However, when re-running the tests using
the test framework or a vanilla workload, the regressions could not be
reproduced.
Notes: 15/30/45 are fds# as well as process/thread pairs in 1 group.
Patch 1
case load baseline(std%) patch1%( std%)
process-pipe-15 1-groups 1.00 ( 14.03) -8.81 ( 6.53)
process-pipe-15 2-groups 1.00 ( 3.46) +1.82 ( 2.59)
process-pipe-15 4-groups 1.00 ( 6.20) +8.60 ( 5.59)
process-pipe-15 8-groups 1.00 ( 2.41) -0.21 ( 3.22)
process-pipe-30 1-groups 1.00 ( 2.51) +2.24 ( 3.12)
process-pipe-30 2-groups 1.00 ( 3.86) -0.58 ( 2.46)
process-pipe-30 4-groups 1.00 ( 2.19) -1.81 ( 1.05)
process-pipe-30 8-groups 1.00 ( 1.69) +0.52 ( 3.01)
process-pipe-45 1-groups 1.00 ( 1.63) +1.63 ( 1.23)
process-pipe-45 2-groups 1.00 ( 0.79) +0.08 ( 1.82)
process-pipe-45 4-groups 1.00 ( 1.62) -0.06 ( 0.63)
process-pipe-45 8-groups 1.00 ( 1.66) -4.12 ( 3.27)
process-sockets-15 1-groups 1.00 ( 3.57) +2.36 ( 5.15)
process-sockets-15 2-groups 1.00 ( 3.59) -1.33 ( 6.86)
process-sockets-15 4-groups 1.00 ( 7.10) +5.44 ( 6.97)
process-sockets-15 8-groups 1.00 ( 2.63) -3.05 ( 1.94)
process-sockets-30 1-groups 1.00 ( 3.73) -2.69 ( 4.89)
process-sockets-30 2-groups 1.00 ( 3.90) -4.25 ( 3.94)
process-sockets-30 4-groups 1.00 ( 1.03) -1.58 ( 1.51)
process-sockets-30 8-groups 1.00 ( 0.48) +1.09 ( 0.68)
process-sockets-45 1-groups 1.00 ( 0.62) -2.25 ( 0.57)
process-sockets-45 2-groups 1.00 ( 2.56) -0.61 ( 0.63)
process-sockets-45 4-groups 1.00 ( 0.57) -0.51 ( 0.79)
process-sockets-45 8-groups 1.00 ( 0.18) -5.23 ( 2.18)
threads-pipe-15 1-groups 1.00 ( 5.30) -1.47 ( 5.38)
threads-pipe-15 2-groups 1.00 ( 7.97) -1.31 ( 8.61)
threads-pipe-15 4-groups 1.00 ( 4.94) -3.31 ( 5.48)
threads-pipe-15 8-groups 1.00 ( 1.69) +7.28 ( 5.54)
threads-pipe-30 1-groups 1.00 ( 5.12) -1.58 ( 4.82)
threads-pipe-30 2-groups 1.00 ( 1.63) +3.29 ( 1.72)
threads-pipe-30 4-groups 1.00 ( 3.41) +3.05 ( 3.22)
threads-pipe-30 8-groups 1.00 ( 2.85) +1.58 ( 4.05)
threads-pipe-45 1-groups 1.00 ( 5.13) -0.78 ( 6.78)
threads-pipe-45 2-groups 1.00 ( 1.92) -2.87 ( 1.27)
threads-pipe-45 4-groups 1.00 ( 2.41) -4.37 ( 1.23)
threads-pipe-45 8-groups 1.00 ( 1.81) +1.85 ( 1.54)
threads-sockets-15 1-groups 1.00 ( 4.72) -0.73 ( 2.75)
threads-sockets-15 2-groups 1.00 ( 3.05) +3.09 ( 3.39)
threads-sockets-15 4-groups 1.00 ( 5.92) +0.87 ( 2.25)
threads-sockets-15 8-groups 1.00 ( 3.75) -7.24 ( 3.34)
threads-sockets-30 1-groups 1.00 ( 5.96) -6.27 ( 3.35)
threads-sockets-30 2-groups 1.00 ( 1.68) -1.78 ( 3.60)
threads-sockets-30 4-groups 1.00 ( 5.02) -0.95 ( 3.60)
threads-sockets-30 8-groups 1.00 ( 0.41) -3.09 ( 2.03)
threads-sockets-45 1-groups 1.00 ( 2.55) -1.32 ( 1.37)
threads-sockets-45 2-groups 1.00 ( 3.53) -0.46 ( 3.99)
threads-sockets-45 4-groups 1.00 ( 0.51) +0.67 ( 0.74)
threads-sockets-45 8-groups 1.00 ( 3.01) -16.85 ( 2.13) **
Patch 2
case load baseline(std%) patch2%( std%)
process-pipe-15 1-groups 1.00 ( 14.03) -3.32 ( 11.34)
process-pipe-15 2-groups 1.00 ( 3.46) +2.19 ( 7.27)
process-pipe-15 4-groups 1.00 ( 6.20) +2.01 ( 2.83)
process-pipe-15 8-groups 1.00 ( 2.41) +1.65 ( 4.39)
process-pipe-30 1-groups 1.00 ( 2.51) -0.88 ( 3.26)
process-pipe-30 2-groups 1.00 ( 3.86) +2.25 ( 3.21)
process-pipe-30 4-groups 1.00 ( 2.19) +0.20 ( 1.72)
process-pipe-30 8-groups 1.00 ( 1.69) +0.85 ( 0.61)
process-pipe-45 1-groups 1.00 ( 1.63) +3.10 ( 4.01)
process-pipe-45 2-groups 1.00 ( 0.79) -1.00 ( 0.69)
process-pipe-45 4-groups 1.00 ( 1.62) +0.07 ( 0.63)
process-pipe-45 8-groups 1.00 ( 1.66) +0.20 ( 1.47)
process-sockets-15 1-groups 1.00 ( 3.57) -5.44 ( 3.45)
process-sockets-15 2-groups 1.00 ( 3.59) +1.00 ( 4.35)
process-sockets-15 4-groups 1.00 ( 7.10) +0.46 ( 4.45)
process-sockets-15 8-groups 1.00 ( 2.63) -1.48 ( 4.56)
process-sockets-30 1-groups 1.00 ( 3.73) -0.17 ( 3.57)
process-sockets-30 2-groups 1.00 ( 3.90) +3.83 ( 7.54)
process-sockets-30 4-groups 1.00 ( 1.03) -2.35 ( 6.11)
process-sockets-30 8-groups 1.00 ( 0.48) -0.43 ( 0.79)
process-sockets-45 1-groups 1.00 ( 0.62) -2.24 ( 1.63)
process-sockets-45 2-groups 1.00 ( 2.56) -1.41 ( 3.17)
process-sockets-45 4-groups 1.00 ( 0.57) -0.29 ( 0.33)
process-sockets-45 8-groups 1.00 ( 0.18) -6.05 ( 1.55)
threads-pipe-15 1-groups 1.00 ( 5.30) -5.83 ( 7.96)
threads-pipe-15 2-groups 1.00 ( 7.97) -3.74 ( 4.22)
threads-pipe-15 4-groups 1.00 ( 4.94) -2.23 ( 5.75)
threads-pipe-15 8-groups 1.00 ( 1.69) +0.21 ( 3.08)
threads-pipe-30 1-groups 1.00 ( 5.12) -5.73 ( 4.97)
threads-pipe-30 2-groups 1.00 ( 1.63) -1.76 ( 4.49)
threads-pipe-30 4-groups 1.00 ( 3.41) -0.99 ( 2.50)
threads-pipe-30 8-groups 1.00 ( 2.85) +0.71 ( 1.04)
threads-pipe-45 1-groups 1.00 ( 5.13) +0.08 ( 5.72)
threads-pipe-45 2-groups 1.00 ( 1.92) -1.78 ( 1.30)
threads-pipe-45 4-groups 1.00 ( 2.41) -3.79 ( 0.81)
threads-pipe-45 8-groups 1.00 ( 1.81) -3.62 ( 1.41)
threads-sockets-15 1-groups 1.00 ( 4.72) +2.52 ( 2.66)
threads-sockets-15 2-groups 1.00 ( 3.05) -7.59 ( 1.80)
threads-sockets-15 4-groups 1.00 ( 5.92) +1.59 ( 7.12)
threads-sockets-15 8-groups 1.00 ( 3.75) -0.34 ( 3.62)
threads-sockets-30 1-groups 1.00 ( 5.96) -2.45 ( 4.89)
threads-sockets-30 2-groups 1.00 ( 1.68) -0.61 ( 4.80)
threads-sockets-30 4-groups 1.00 ( 5.02) -2.15 ( 8.62)
threads-sockets-30 8-groups 1.00 ( 0.41) -17.32 ( 0.88) **
threads-sockets-45 1-groups 1.00 ( 2.55) -3.24 ( 3.37)
threads-sockets-45 2-groups 1.00 ( 3.53) -1.38 ( 2.40)
threads-sockets-45 4-groups 1.00 ( 0.51) -0.17 ( 0.85)
threads-sockets-45 8-groups 1.00 ( 3.01) -14.59 ( 5.48) **
Patch 3
case load baseline(std%) patch3%( std%)
process-pipe-15 1-groups 1.00 ( 14.03) -10.18 ( 3.39) **
process-pipe-15 2-groups 1.00 ( 3.46) +5.18 ( 3.12)
process-pipe-15 4-groups 1.00 ( 6.20) +8.63 ( 5.72)
process-pipe-15 8-groups 1.00 ( 2.41) +5.37 ( 2.24)
process-pipe-30 1-groups 1.00 ( 2.51) +5.53 ( 3.55)
process-pipe-30 2-groups 1.00 ( 3.86) +5.70 ( 4.27)
process-pipe-30 4-groups 1.00 ( 2.19) +3.95 ( 3.34)
process-pipe-30 8-groups 1.00 ( 1.69) -3.38 ( 1.51)
process-pipe-45 1-groups 1.00 ( 1.63) +5.19 ( 2.51)
process-pipe-45 2-groups 1.00 ( 0.79) -0.63 ( 2.06)
process-pipe-45 4-groups 1.00 ( 1.62) -5.83 ( 2.22)
process-pipe-45 8-groups 1.00 ( 1.66) -6.13 ( 2.34)
process-sockets-15 1-groups 1.00 ( 3.57) -1.51 ( 4.21)
process-sockets-15 2-groups 1.00 ( 3.59) -1.30 ( 7.50)
process-sockets-15 4-groups 1.00 ( 7.10) -1.80 ( 5.58)
process-sockets-15 8-groups 1.00 ( 2.63) -1.68 ( 3.40)
process-sockets-30 1-groups 1.00 ( 3.73) -7.74 ( 1.58)
process-sockets-30 2-groups 1.00 ( 3.90) -1.98 ( 5.48)
process-sockets-30 4-groups 1.00 ( 1.03) -0.33 ( 3.47)
process-sockets-30 8-groups 1.00 ( 0.48) -0.40 ( 0.84)
process-sockets-45 1-groups 1.00 ( 0.62) -0.21 ( 0.54)
process-sockets-45 2-groups 1.00 ( 2.56) -1.97 ( 2.48)
process-sockets-45 4-groups 1.00 ( 0.57) -0.61 ( 0.83)
process-sockets-45 8-groups 1.00 ( 0.18) -5.09 ( 1.85)
threads-pipe-15 1-groups 1.00 ( 5.30) +3.62 ( 11.04)
threads-pipe-15 2-groups 1.00 ( 7.97) +8.08 ( 4.63)
threads-pipe-15 4-groups 1.00 ( 4.94) +6.46 ( 5.27)
threads-pipe-15 8-groups 1.00 ( 1.69) +2.68 ( 3.23)
threads-pipe-30 1-groups 1.00 ( 5.12) +3.60 ( 7.09)
threads-pipe-30 2-groups 1.00 ( 1.63) -0.80 ( 4.43)
threads-pipe-30 4-groups 1.00 ( 3.41) +2.37 ( 2.16)
threads-pipe-30 8-groups 1.00 ( 2.85) +4.17 ( 1.41)
threads-pipe-45 1-groups 1.00 ( 5.13) +7.41 ( 4.48)
threads-pipe-45 2-groups 1.00 ( 1.92) -1.40 ( 2.69)
threads-pipe-45 4-groups 1.00 ( 2.41) -1.25 ( 2.15)
threads-pipe-45 8-groups 1.00 ( 1.81) +1.62 ( 0.73)
threads-sockets-15 1-groups 1.00 ( 4.72) +10.11 ( 7.95)
threads-sockets-15 2-groups 1.00 ( 3.05) -8.41 ( 5.93)
threads-sockets-15 4-groups 1.00 ( 5.92) -10.89 ( 4.29) **
threads-sockets-15 8-groups 1.00 ( 3.75) -7.66 ( 3.33)
threads-sockets-30 1-groups 1.00 ( 5.96) -5.18 ( 2.77)
threads-sockets-30 2-groups 1.00 ( 1.68) -4.91 ( 3.89)
threads-sockets-30 4-groups 1.00 ( 5.02) -6.32 ( 4.19)
threads-sockets-30 8-groups 1.00 ( 0.41) -11.73 ( 0.96) **
threads-sockets-45 1-groups 1.00 ( 2.55) -3.16 ( 1.97)
threads-sockets-45 2-groups 1.00 ( 3.53) -0.21 ( 4.33)
threads-sockets-45 4-groups 1.00 ( 0.51) -0.75 ( 2.07)
threads-sockets-45 8-groups 1.00 ( 3.01) -20.52 ( 1.44) **
Patch 4
case load baseline(std%) patch4%( std%)
process-pipe-15 1-groups 1.00 ( 14.03) -2.68 ( 9.64)
process-pipe-15 2-groups 1.00 ( 3.46) +1.82 ( 7.55)
process-pipe-15 4-groups 1.00 ( 6.20) +3.67 ( 8.17)
process-pipe-15 8-groups 1.00 ( 2.41) +1.87 ( 0.92)
process-pipe-30 1-groups 1.00 ( 2.51) -3.34 ( 3.96)
process-pipe-30 2-groups 1.00 ( 3.86) -0.33 ( 3.53)
process-pipe-30 4-groups 1.00 ( 2.19) -3.22 ( 1.31)
process-pipe-30 8-groups 1.00 ( 1.69) -1.95 ( 1.07)
process-pipe-45 1-groups 1.00 ( 1.63) +0.63 ( 2.86)
process-pipe-45 2-groups 1.00 ( 0.79) -1.27 ( 1.39)
process-pipe-45 4-groups 1.00 ( 1.62) -2.04 ( 1.87)
process-pipe-45 8-groups 1.00 ( 1.66) -1.45 ( 3.20)
process-sockets-15 1-groups 1.00 ( 3.57) -9.16 ( 5.33)
process-sockets-15 2-groups 1.00 ( 3.59) -1.83 ( 5.36)
process-sockets-15 4-groups 1.00 ( 7.10) +7.55 ( 6.34)
process-sockets-15 8-groups 1.00 ( 2.63) -2.98 ( 5.95)
process-sockets-30 1-groups 1.00 ( 3.73) +3.50 ( 4.92)
process-sockets-30 2-groups 1.00 ( 3.90) +1.80 ( 5.68)
process-sockets-30 4-groups 1.00 ( 1.03) -1.23 ( 4.79)
process-sockets-30 8-groups 1.00 ( 0.48) -0.15 ( 0.33)
process-sockets-45 1-groups 1.00 ( 0.62) -0.70 ( 1.12)
process-sockets-45 2-groups 1.00 ( 2.56) +0.64 ( 0.86)
process-sockets-45 4-groups 1.00 ( 0.57) +0.09 ( 0.53)
process-sockets-45 8-groups 1.00 ( 0.18) -7.31 ( 2.11)
threads-pipe-15 1-groups 1.00 ( 5.30) +4.94 ( 9.52)
threads-pipe-15 2-groups 1.00 ( 7.97) -4.28 ( 2.30)
threads-pipe-15 4-groups 1.00 ( 4.94) -1.83 ( 4.24)
threads-pipe-15 8-groups 1.00 ( 1.69) -2.35 ( 1.50)
threads-pipe-30 1-groups 1.00 ( 5.12) +2.06 ( 5.00)
threads-pipe-30 2-groups 1.00 ( 1.63) +0.93 ( 4.53)
threads-pipe-30 4-groups 1.00 ( 3.41) -2.85 ( 3.20)
threads-pipe-30 8-groups 1.00 ( 2.85) -2.20 ( 2.68)
threads-pipe-45 1-groups 1.00 ( 5.13) -0.97 ( 4.70)
threads-pipe-45 2-groups 1.00 ( 1.92) -2.11 ( 1.21)
threads-pipe-45 4-groups 1.00 ( 2.41) -2.69 ( 1.33)
threads-pipe-45 8-groups 1.00 ( 1.81) -2.41 ( 1.14)
threads-sockets-15 1-groups 1.00 ( 4.72) +0.82 ( 4.21)
threads-sockets-15 2-groups 1.00 ( 3.05) -1.28 ( 2.48)
threads-sockets-15 4-groups 1.00 ( 5.92) -1.75 ( 7.25)
threads-sockets-15 8-groups 1.00 ( 3.75) -2.54 ( 3.49)
threads-sockets-30 1-groups 1.00 ( 5.96) -0.46 ( 5.30)
threads-sockets-30 2-groups 1.00 ( 1.68) -0.45 ( 1.75)
threads-sockets-30 4-groups 1.00 ( 5.02) -1.48 ( 6.51)
threads-sockets-30 8-groups 1.00 ( 0.41) -13.09 ( 1.61) **
threads-sockets-45 1-groups 1.00 ( 2.55) -1.68 ( 0.66)
threads-sockets-45 2-groups 1.00 ( 3.53) +0.21 ( 2.23)
threads-sockets-45 4-groups 1.00 ( 0.51) -1.27 ( 1.43)
threads-sockets-45 8-groups 1.00 ( 3.01) -3.41 ( 0.43)
Additionally, profiling data was collected using `perf record -a` for this
workload. Firstly, the cycles distribution are almost the same among
baseline and patch1-4. Secondly, the patch1-4 relevant symbols identified
were set_rd_overloaded/set_rd_overutilized, which is potentially invoked
(actually inlined) by `update_sd_lb_stats`. The `update_sd_lb_stats`
itself takes ~2.6% cycles in the baseline configuration
threads-sockets-45fd, 8 groups, while no regressions were observed in
patches 1-4 about this function. So I think the patches won't cause
regressions to hackbench.
Schbench(old, 91ea787)
======================
The workload is ran by the same metholody as hackbench, with runtime 100s.
Test result is as following, the regression over -5% is marked
with ** at the end of the line, while, when re-run the test with either
test framework or vanilla workload, the regression cannot re-produced.
case load baseline(std%) opt1%( std%)
normal 1-mthreads-8-workers 1.00 ( 1.44) -5.60 ( 2.96) **
normal 1-mthreads-2-workers 1.00 ( 2.79) -2.65 ( 5.48)
normal 1-mthreads-1-workers 1.00 ( 1.27) -1.60 ( 1.03)
normal 1-mthreads-31-workers 1.00 ( 1.30) -0.87 ( 2.34)
normal 1-mthreads-16-workers 1.00 ( 1.74) -2.23 ( 1.15)
normal 1-mthreads-4-workers 1.00 ( 3.35) -1.92 ( 1.62)
normal 2-mthreads-8-workers 1.00 ( 2.17) -2.09 ( 1.38)
normal 2-mthreads-31-workers 1.00 ( 1.83) +1.93 ( 1.84)
normal 2-mthreads-16-workers 1.00 ( 2.06) +0.36 ( 2.38)
normal 2-mthreads-1-workers 1.00 ( 3.86) +0.50 ( 2.46)
normal 2-mthreads-2-workers 1.00 ( 1.76) -6.91 ( 2.55)
normal 2-mthreads-4-workers 1.00 ( 1.59) -5.58 ( 5.99)
normal 4-mthreads-8-workers 1.00 ( 0.85) +0.59 ( 0.54)
normal 4-mthreads-31-workers 1.00 ( 15.31) +15.04 ( 12.71)
normal 4-mthreads-16-workers 1.00 ( 0.99) -2.62 ( 2.15)
normal 4-mthreads-4-workers 1.00 ( 1.42) -2.72 ( 1.70)
normal 4-mthreads-1-workers 1.00 ( 1.43) -2.84 ( 1.73)
normal 4-mthreads-2-workers 1.00 ( 1.78) -4.28 ( 2.08)
normal 8-mthreads-16-workers 1.00 ( 10.04) +7.06 ( 0.73)
normal 8-mthreads-31-workers 1.00 ( 1.94) -1.66 ( 2.28)
normal 8-mthreads-2-workers 1.00 ( 2.51) -0.30 ( 1.53)
normal 8-mthreads-8-workers 1.00 ( 1.56) -1.83 ( 1.39)
normal 8-mthreads-1-workers 1.00 ( 4.08) +0.45 ( 1.45)
normal 8-mthreads-4-workers 1.00 ( 1.84) +2.85 ( 1.07)
case load baseline(std%) opt2%( std%)
normal 1-mthreads-8-workers 1.00 ( 1.44) -1.48 ( 3.79)
normal 1-mthreads-2-workers 1.00 ( 2.79) +3.32 ( 0.90)
normal 1-mthreads-1-workers 1.00 ( 1.27) +1.98 ( 1.02)
normal 1-mthreads-31-workers 1.00 ( 1.30) +5.84 ( 3.01)
normal 1-mthreads-16-workers 1.00 ( 1.74) +5.90 ( 0.68)
normal 1-mthreads-4-workers 1.00 ( 3.35) +1.82 ( 1.65)
normal 2-mthreads-8-workers 1.00 ( 2.17) +2.80 ( 2.04)
normal 2-mthreads-31-workers 1.00 ( 1.83) -0.07 ( 1.09)
normal 2-mthreads-16-workers 1.00 ( 2.06) +2.45 ( 2.55)
normal 2-mthreads-1-workers 1.00 ( 3.86) +2.41 ( 2.92)
normal 2-mthreads-2-workers 1.00 ( 1.76) -1.29 ( 2.03)
normal 2-mthreads-4-workers 1.00 ( 1.59) +0.44 ( 1.15)
normal 4-mthreads-8-workers 1.00 ( 0.85) -0.81 ( 3.03)
normal 4-mthreads-31-workers 1.00 ( 15.31) +2.06 ( 15.97)
normal 4-mthreads-16-workers 1.00 ( 0.99) -1.46 ( 2.29)
normal 4-mthreads-4-workers 1.00 ( 1.42) -0.15 ( 3.37)
normal 4-mthreads-1-workers 1.00 ( 1.43) +0.97 ( 1.95)
normal 4-mthreads-2-workers 1.00 ( 1.78) -0.38 ( 2.53)
normal 8-mthreads-16-workers 1.00 ( 10.04) +5.80 ( 1.72)
normal 8-mthreads-31-workers 1.00 ( 1.94) -0.76 ( 2.33)
normal 8-mthreads-2-workers 1.00 ( 2.51) +2.47 ( 2.17)
normal 8-mthreads-8-workers 1.00 ( 1.56) -0.66 ( 1.47)
normal 8-mthreads-1-workers 1.00 ( 4.08) +2.71 ( 2.78)
normal 8-mthreads-4-workers 1.00 ( 1.84) +2.35 ( 4.88)
case load baseline(std%) opt3%( std%)
normal 1-mthreads-8-workers 1.00 ( 1.44) -6.90 ( 3.85) **
normal 1-mthreads-2-workers 1.00 ( 2.79) +3.23 ( 3.09)
normal 1-mthreads-1-workers 1.00 ( 1.27) -1.04 ( 2.22)
normal 1-mthreads-31-workers 1.00 ( 1.30) +2.16 ( 1.64)
normal 1-mthreads-16-workers 1.00 ( 1.74) -0.72 ( 5.70)
normal 1-mthreads-4-workers 1.00 ( 3.35) -1.92 ( 4.31)
normal 2-mthreads-8-workers 1.00 ( 2.17) +0.82 ( 1.90)
normal 2-mthreads-31-workers 1.00 ( 1.83) +2.08 ( 1.16)
normal 2-mthreads-16-workers 1.00 ( 2.06) +4.04 ( 2.42)
normal 2-mthreads-1-workers 1.00 ( 3.86) +2.57 ( 3.44)
normal 2-mthreads-2-workers 1.00 ( 1.76) -0.12 ( 1.29)
normal 2-mthreads-4-workers 1.00 ( 1.59) -2.04 ( 2.83)
normal 4-mthreads-8-workers 1.00 ( 0.85) +0.22 ( 1.65)
normal 4-mthreads-31-workers 1.00 ( 15.31) +15.09 ( 9.83)
normal 4-mthreads-16-workers 1.00 ( 0.99) +1.46 ( 1.88)
normal 4-mthreads-4-workers 1.00 ( 1.42) +2.34 ( 1.57)
normal 4-mthreads-1-workers 1.00 ( 1.43) -0.77 ( 2.45)
normal 4-mthreads-2-workers 1.00 ( 1.78) -1.16 ( 1.85)
normal 8-mthreads-16-workers 1.00 ( 10.04) +7.39 ( 1.65)
normal 8-mthreads-31-workers 1.00 ( 1.94) -0.81 ( 2.14)
normal 8-mthreads-2-workers 1.00 ( 2.51) -1.93 ( 2.00)
normal 8-mthreads-8-workers 1.00 ( 1.56) +1.17 ( 1.40)
normal 8-mthreads-1-workers 1.00 ( 4.08) +1.63 ( 0.51)
normal 8-mthreads-4-workers 1.00 ( 1.84) +4.77 ( 2.36)
case load baseline(std%) opt4%( std%)
normal 1-mthreads-8-workers 1.00 ( 1.44) -0.27 ( 3.05)
normal 1-mthreads-2-workers 1.00 ( 2.79) -0.31 ( 1.19)
normal 1-mthreads-1-workers 1.00 ( 1.27) +1.62 ( 1.77)
normal 1-mthreads-31-workers 1.00 ( 1.30) +1.30 ( 3.34)
normal 1-mthreads-16-workers 1.00 ( 1.74) +0.07 ( 3.38)
normal 1-mthreads-4-workers 1.00 ( 3.35) +1.08 ( 2.48)
normal 2-mthreads-8-workers 1.00 ( 2.17) +0.04 ( 3.87)
normal 2-mthreads-31-workers 1.00 ( 1.83) +1.29 ( 1.44)
normal 2-mthreads-16-workers 1.00 ( 2.06) +0.94 ( 2.96)
normal 2-mthreads-1-workers 1.00 ( 3.86) +2.85 ( 2.12)
normal 2-mthreads-2-workers 1.00 ( 1.76) -0.30 ( 2.37)
normal 2-mthreads-4-workers 1.00 ( 1.59) +2.22 ( 1.51)
normal 4-mthreads-8-workers 1.00 ( 0.85) +2.20 ( 3.06)
normal 4-mthreads-31-workers 1.00 ( 15.31) +15.65 ( 12.68)
normal 4-mthreads-16-workers 1.00 ( 0.99) -1.96 ( 3.30)
normal 4-mthreads-4-workers 1.00 ( 1.42) -1.19 ( 3.42)
normal 4-mthreads-1-workers 1.00 ( 1.43) +2.26 ( 2.45)
normal 4-mthreads-2-workers 1.00 ( 1.78) -1.36 ( 2.75)
normal 8-mthreads-16-workers 1.00 ( 10.04) -0.33 ( 11.13)
normal 8-mthreads-31-workers 1.00 ( 1.94) -1.14 ( 2.01)
normal 8-mthreads-2-workers 1.00 ( 2.51) +2.32 ( 2.26)
normal 8-mthreads-8-workers 1.00 ( 1.56) -0.44 ( 1.54)
normal 8-mthreads-1-workers 1.00 ( 4.08) +2.17 ( 2.10)
normal 8-mthreads-4-workers 1.00 ( 1.84) +3.42 ( 2.34)
Again, per perf record data, the cycles distribution are almost the
same among baseline and patch1-4. The symbols related to patches 1-4
are set_rd_overloaded/set_rd_overutilized that is inlined in
`update_sd_lb_stats`, which accounts for ~0.47% (self) cycles in
baseline in 1 message thread and 8 workers configuration, and no
regressions were observed in patches 1-4 about this function.
So I think the patches won't cause regressions to schbench(old).
Schbench(new, 48aed1d)
======================
The workload was executed using the test framework available at
https://github.com/gormanm/mmtests. Each configuration was run for
5 iterations, with a runtime of 100 seconds per iteration. No
significant regressions were observed, as detailed below:
Notes:
1. message threads# are always 6, the same to numa node#
2. 1/2/4/8/16/32/64/79 are worker# per message thread
baseline patch1
Amean request-99.0th-qrtle-1 1.00 0.00%
Amean rps-50.0th-qrtle-1 1.00 0.06%
Amean wakeup-99.0th-qrtle-1 1.00 0.26%
Amean request-99.0th-qrtle-2 1.00 0.23%
Amean rps-50.0th-qrtle-2 1.00 0.00%
Amean wakeup-99.0th-qrtle-2 1.00 1.09%
Amean request-99.0th-qrtle-4 1.00 -1.32%
Amean rps-50.0th-qrtle-4 1.00 0.11%
Amean wakeup-99.0th-qrtle-4 1.00 -0.41%
Amean request-99.0th-qrtle-8 1.00 -0.08%
Amean rps-50.0th-qrtle-8 1.00 -0.17%
Amean wakeup-99.0th-qrtle-8 1.00 0.37%
Amean request-99.0th-qrtle-16 1.00 0.23%
Amean rps-50.0th-qrtle-16 1.00 -0.06%
Amean wakeup-99.0th-qrtle-16 1.00 1.03%
Amean request-99.0th-qrtle-32 1.00 0.27%
Amean rps-50.0th-qrtle-32 1.00 0.06%
Amean wakeup-99.0th-qrtle-32 1.00 -0.37%
Amean request-99.0th-qrtle-64 1.00 0.57%
Amean rps-50.0th-qrtle-64 1.00 -0.28%
Amean wakeup-99.0th-qrtle-64 1.00 -3.00%
Amean request-99.0th-qrtle-79 1.00 0.21%
Amean rps-50.0th-qrtle-79 1.00 -0.23%
Amean wakeup-99.0th-qrtle-79 1.00 2.00%
baseline patch2
Amean request-99.0th-qrtle-1 1.00 -0.46%
Amean rps-50.0th-qrtle-1 1.00 0.11%
Amean wakeup-99.0th-qrtle-1 1.00 -2.01%
Amean request-99.0th-qrtle-2 1.00 -0.08%
Amean rps-50.0th-qrtle-2 1.00 0.00%
Amean wakeup-99.0th-qrtle-2 1.00 -1.42%
Amean request-99.0th-qrtle-4 1.00 -1.16%
Amean rps-50.0th-qrtle-4 1.00 0.11%
Amean wakeup-99.0th-qrtle-4 1.00 -1.30%
Amean request-99.0th-qrtle-8 1.00 -0.08%
Amean rps-50.0th-qrtle-8 1.00 -0.40%
Amean wakeup-99.0th-qrtle-8 1.00 1.25%
Amean request-99.0th-qrtle-16 1.00 0.46%
Amean rps-50.0th-qrtle-16 1.00 -0.06%
Amean wakeup-99.0th-qrtle-16 1.00 2.52%
Amean request-99.0th-qrtle-32 1.00 14.83%
Amean rps-50.0th-qrtle-32 1.00 0.75%
Amean wakeup-99.0th-qrtle-32 1.00 3.03%
Amean request-99.0th-qrtle-64 1.00 -0.44%
Amean rps-50.0th-qrtle-64 1.00 0.28%
Amean wakeup-99.0th-qrtle-64 1.00 -3.50%
Amean request-99.0th-qrtle-79 1.00 -0.09%
Amean rps-50.0th-qrtle-79 1.00 0.08%
Amean wakeup-99.0th-qrtle-79 1.00 -1.20%
baseline patch3
Amean request-99.0th-qrtle-1 1.00 0.31%
Amean rps-50.0th-qrtle-1 1.00 -0.17%
Amean wakeup-99.0th-qrtle-1 1.00 0.44%
Amean request-99.0th-qrtle-2 1.00 -0.61%
Amean rps-50.0th-qrtle-2 1.00 -0.29%
Amean wakeup-99.0th-qrtle-2 1.00 1.93%
Amean request-99.0th-qrtle-4 1.00 -1.62%
Amean rps-50.0th-qrtle-4 1.00 -0.17%
Amean wakeup-99.0th-qrtle-4 1.00 0.00%
Amean request-99.0th-qrtle-8 1.00 0.00%
Amean rps-50.0th-qrtle-8 1.00 -0.40%
Amean wakeup-99.0th-qrtle-8 1.00 -0.29%
Amean request-99.0th-qrtle-16 1.00 0.53%
Amean rps-50.0th-qrtle-16 1.00 -0.17%
Amean wakeup-99.0th-qrtle-16 1.00 -1.03%
Amean request-99.0th-qrtle-32 1.00 0.09%
Amean rps-50.0th-qrtle-32 1.00 -0.17%
Amean wakeup-99.0th-qrtle-32 1.00 2.41%
Amean request-99.0th-qrtle-64 1.00 0.26%
Amean rps-50.0th-qrtle-64 1.00 -0.16%
Amean wakeup-99.0th-qrtle-64 1.00 -2.00%
Amean request-99.0th-qrtle-79 1.00 0.26%
Amean rps-50.0th-qrtle-79 1.00 -0.46%
Amean wakeup-99.0th-qrtle-79 1.00 1.20%
baseline patch4
Amean request-99.0th-qrtle-1 1.00 -0.15%
Amean rps-50.0th-qrtle-1 1.00 -0.06%
Amean wakeup-99.0th-qrtle-1 1.00 -2.88%
Amean request-99.0th-qrtle-2 1.00 -0.31%
Amean rps-50.0th-qrtle-2 1.00 -0.29%
Amean wakeup-99.0th-qrtle-2 1.00 -0.59%
Amean request-99.0th-qrtle-4 1.00 -0.23%
Amean rps-50.0th-qrtle-4 1.00 -0.11%
Amean wakeup-99.0th-qrtle-4 1.00 -0.41%
Amean request-99.0th-qrtle-8 1.00 -0.08%
Amean rps-50.0th-qrtle-8 1.00 -0.52%
Amean wakeup-99.0th-qrtle-8 1.00 1.91%
Amean request-99.0th-qrtle-16 1.00 0.76%
Amean rps-50.0th-qrtle-16 1.00 0.06%
Amean wakeup-99.0th-qrtle-16 1.00 1.03%
Amean request-99.0th-qrtle-32 1.00 8.36%
Amean rps-50.0th-qrtle-32 1.00 0.00%
Amean wakeup-99.0th-qrtle-32 1.00 -1.05%
Amean request-99.0th-qrtle-64 1.00 0.13%
Amean rps-50.0th-qrtle-64 1.00 0.00%
Amean wakeup-99.0th-qrtle-64 1.00 -4.00%
Amean request-99.0th-qrtle-79 1.00 -0.39%
Amean rps-50.0th-qrtle-79 1.00 0.14%
Amean wakeup-99.0th-qrtle-79 1.00 -0.40%
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
2025-07-07 2:35 ` [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
@ 2025-09-01 5:10 ` Chen, Yu C
2025-09-01 13:24 ` Deng, Pan
0 siblings, 1 reply; 16+ messages in thread
From: Chen, Yu C @ 2025-09-01 5:10 UTC (permalink / raw)
To: Pan Deng; +Cc: linux-kernel, tianyou.li, tim.c.chen, peterz, mingo, Chen Yu
On 7/7/2025 10:35 AM, Pan Deng wrote:
> When running a multi-instance FFmpeg workload on an HCC system, significant
> cache line contention is observed around `cpupri_vec->count` and `mask` in
> struct root_domain.
>
[it seems that my last reply did not make it to the lkml][snip]
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> index d6cba0020064..245b0fa626be 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -9,7 +9,7 @@
>
> struct cpupri_vec {
> atomic_t count;
> - cpumask_var_t mask;
> + cpumask_var_t mask ____cacheline_aligned;
Just curious, since this is to avoid cache contention among CPUs,
is it better to use ____cacheline_aligned_in_smp, so the single
CPU system is not impacted.
thanks,
Chenyu> };
>
> struct cpupri {
^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
2025-09-01 5:10 ` Chen, Yu C
@ 2025-09-01 13:24 ` Deng, Pan
0 siblings, 0 replies; 16+ messages in thread
From: Deng, Pan @ 2025-09-01 13:24 UTC (permalink / raw)
To: Chen, Yu C
Cc: linux-kernel@vger.kernel.org, Li, Tianyou,
tim.c.chen@linux.intel.com, peterz@infradead.org,
mingo@kernel.org, Chen Yu
Thanks Yu, will update the patch.
Best Regards
Pan
> -----Original Message-----
> From: Chen, Yu C <yu.c.chen@intel.com>
> Sent: Monday, September 1, 2025 1:10 PM
> To: Deng, Pan <pan.deng@intel.com>
> Cc: linux-kernel@vger.kernel.org; Li, Tianyou <tianyou.li@intel.com>;
> tim.c.chen@linux.intel.com; peterz@infradead.org; mingo@kernel.org; Chen Yu
> <yu.chen.surf@gmail.com>
> Subject: Re: [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache
> line contention
>
> On 7/7/2025 10:35 AM, Pan Deng wrote:
> > When running a multi-instance FFmpeg workload on an HCC system,
> significant
> > cache line contention is observed around `cpupri_vec->count` and `mask` in
> > struct root_domain.
> >
>
> [it seems that my last reply did not make it to the lkml][snip]
>
> > diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> > index d6cba0020064..245b0fa626be 100644
> > --- a/kernel/sched/cpupri.h
> > +++ b/kernel/sched/cpupri.h
> > @@ -9,7 +9,7 @@
> >
> > struct cpupri_vec {
> > atomic_t count;
> > - cpumask_var_t mask;
> > + cpumask_var_t mask ____cacheline_aligned;
>
> Just curious, since this is to avoid cache contention among CPUs,
> is it better to use ____cacheline_aligned_in_smp, so the single
> CPU system is not impacted.
>
> thanks,
> Chenyu> };
> >
> > struct cpupri {
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-09-01 13:24 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-07 2:35 [PATCH 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
2025-07-07 2:35 ` [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
2025-09-01 5:10 ` Chen, Yu C
2025-09-01 13:24 ` Deng, Pan
2025-07-07 2:35 ` [PATCH 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
2025-07-07 2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
2025-07-07 6:53 ` kernel test robot
2025-07-07 11:36 ` Deng, Pan
2025-07-07 6:53 ` kernel test robot
2025-07-08 5:33 ` kernel test robot
2025-07-08 14:02 ` Deng, Pan
2025-07-09 8:56 ` Li, Philip
2025-07-07 2:35 ` [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
2025-07-21 11:23 ` Chen, Yu C
2025-07-22 14:46 ` Deng, Pan
2025-08-06 14:00 ` Deng, Pan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).