* [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention
@ 2025-07-21 6:10 Pan Deng
2025-07-21 6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Pan Deng @ 2025-07-21 6:10 UTC (permalink / raw)
To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng
When running multi-instance FFmpeg workload in cloud environment,
cache line contention is severe during the access to root_domain data
structures, which significantly degrades performance.
The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS(frame per second) is used as score.
Profiling shows the kernel consumes ~20% of CPU cycles, which is
excessive in this scenario. The overhead primarily comes from RT task
scheduling functions like `cpupri_set`, `cpupri_find_fitness`,
`dequeue_pushable_task`, `enqueue_pushable_task`, `pull_rt_task`,
`__find_first_and_bit`, and `__bitmap_and`. This is due to read/write
contention on root_domain cache lines.
The `perf c2c` report, sorted by contention severity, reveals:
root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` is heavily loaded/stored,
since counts[0] is more frequently updated than others along with a
rt task enqueues an empty runq or dequeues from a non-overloaded runq.
- `rto_mask` is heavily loaded
- `rto_loop_next` and `rto_loop_start` are frequently stored
- `rto_push_work` and `rto_lock` are lightly accessed
- cycles per load: ~10K to 59K.
root_domain cache line 1:
- `rto_count` is frequently loaded/stored
- `overloaded` is heavily loaded
- cycles per load: ~2.8K to 44K
cpumask (bitmap) cache line of cpupri_vec->mask:
- bits are loaded during cpupri_find
- bits are stored during cpupri_set
- cycles per load: ~2.2K to 8.7K
The end cache line of cpupri:
- `cpupri_vec->count` and `mask` contends. The transcoding threads use
rt pri 99, so that the contention occurs in the end.
- cycles per load: ~1.5K to 10.5K
According to above, we propose 4 patches to mitigate the contention,
each patch resolves part of above issues:
Patch 1: Reorganize `cpupri_vec`, separate `count`, `mask` fields,
reducing contention on root_domain cache line 3 and cpupri's
last cache line. This patch has an alternative implementation,
which is described in the patch commit message, welcome any
comments.
Patch 2: Restructure `root_domain` structure to minimize contention of
root_domain cache line 1 and 3 by reordering fields.
Patch 3: Split `root_domain->rto_count` to per-NUMA-node counters,
reducing the contention on root_domain cache line 1.
Patch 4: Split `cpupri_vec->cpumask` to per-NUMA-node bitmaps, reducing
load/store contention on the cpumask bitmap cache line.
Evaluation:
The patches are tested non-cumulatively, I'm happly to provide additional
data as needed.
FFmpeg benchmark:
Performance changes (FPS):
- Baseline: 100.0%
- Baseline + Patch 1: 111.0%
- Baseline + Patch 2: 105.0%
- Baseline + Patch 3: 104.0%
- Baseline + Patch 4: 103.8%
Kernel CPU cycle usage(lower is better):
- Baseline: 20.0%
- Baseline + Patch 1: 11.0%
- Baseline + Patch 2: 17.7%
- Baseline + Patch 3: 18.6%
- Baseline + Patch 4: 18.7%
Cycles per load reduction (by perf c2c report):
- Patch 1:
- `root_domain` cache line 3: 10K–59K -> 0.5K–8K
- `cpupri` last cache line: 1.5K–10.5K -> eliminated
- Patch 2:
- `root_domain` cache line 1: 2.8K–44K -> 2.1K–2.7K
- `root_domain` cache line 3: 10K–59K -> eliminated
- Patch 3:
- `root_domain` cache line 1: 2.8K–44K -> eliminated
- Patch 4:
- `cpupri_vec->mask` cache line: 2.2K–8.7K -> 0.5K–2.2K
stress-ng rt cyclic benchmark:
Command:
stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \
--timeout 30 --minimize --metrics
Performance changes (bogo ops/s, real time):
- Baseline: 100.0%
- Baseline + Patch 1: 131.4%
- Baseline + Patch 2: 118.6%
- Baseline + Patch 3: 150.4%
- Baseline + Patch 4: 105.9%
rt-tests pi_stress benchmark:
Command:
rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))
Performance changes (Total inversions performed):
- Baseline: 100.0%
- Baseline + Patch 1: 176.5%
- Baseline + Patch 2: 104.7%
- Baseline + Patch 3: 105.1%
- Baseline + Patch 4: 109.3%
Changes since v1:
- Patch 3: Fixed non CONFIG_SMP build issue.
- Patch 1-4: Added stress-ng/cyclic and rt-tests/pi_stress test result.
Comments are appreciated, I'm looking forward to hearing feedback
making revisions, thanks a lot!
Pan Deng (4):
sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
sched/rt: Restructure root_domain to reduce cacheline contention
sched/rt: Split root_domain->rto_count to per-NUMA-node counters
sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce
contention
kernel/sched/cpupri.c | 200 ++++++++++++++++++++++++++++++++++++----
kernel/sched/cpupri.h | 6 +-
kernel/sched/rt.c | 56 ++++++++++-
kernel/sched/sched.h | 61 ++++++------
kernel/sched/topology.c | 7 ++
5 files changed, 282 insertions(+), 48 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
2025-07-21 6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
@ 2025-07-21 6:10 ` Pan Deng
2025-07-21 6:10 ` [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Pan Deng @ 2025-07-21 6:10 UTC (permalink / raw)
To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng
When running a multi-instance FFmpeg workload on an HCC system, significant
cache line contention is observed around `cpupri_vec->count` and `mask` in
struct root_domain.
The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.
perf c2c tool reveals:
root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored
and contends with other fields, since counts[0] is more frequently
updated than others along with a rt task enqueues an empty runq or
dequeues from a non-overloaded runq.
- cycles per load: ~10K to 59K
cpupri's last cache line:
- `cpupri_vec->count` and `mask` contends. The transcoding threads use
rt pri 99, so that the contention occurs in the end.
- cycles per load: ~1.5K to 10.5K
This change mitigates `cpupri_vec->count`, `mask` related contentions by
separating each count and mask into different cache lines.
As a result:
- FPS improves by ~11%
- Kernel cycles% drops from ~20% to ~11%
- `count` and `mask` related cache line contention is mitigated, perf c2c
shows root_domain cache line 3 `cycles per load` drops from ~10K-59K
to ~0.5K-8K, cpupri's last cache line no longer appears in the report.
- stress-ng cyclic benchmark is improved ~31.4%, command:
stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \
--timeout 30 --minimize --metrics
- rt-tests/pi_stress is improved ~76.5%, command:
rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))
Appendix:
1. Current layout of contended data structure:
struct root_domain {
...
struct irq_work rto_push_work; /* 120 32 */
/* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
raw_spinlock_t rto_lock; /* 152 4 */
int rto_loop; /* 156 4 */
int rto_cpu; /* 160 4 */
atomic_t rto_loop_next; /* 164 4 */
atomic_t rto_loop_start; /* 168 4 */
/* XXX 4 bytes hole, try to pack */
cpumask_var_t rto_mask; /* 176 8 */
/* --- cacheline 3 boundary (192 bytes) was 8 bytes hence --- */
struct cpupri cpupri; /* 184 1624 */
/* --- cacheline 28 boundary (1792 bytes) was 16 bytes ago --- */
struct perf_domain * pd; /* 1808 8 */
/* size: 1816, cachelines: 29, members: 21 */
/* sum members: 1802, holes: 3, sum holes: 14 */
/* forced alignments: 1 */
/* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));
2. Perf c2c report of root_domain cache line 3:
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
353 44 62 0xff14d42c400e3880
------- ------- ------ ------ ------ ------ ------------------------
0.00% 2.27% 0.00% 0x0 21683 6 __flush_smp_call_function_
0.00% 2.27% 0.00% 0x0 22294 5 __flush_smp_call_function_
0.28% 0.00% 0.00% 0x0 0 2 irq_work_queue_on
0.28% 0.00% 0.00% 0x0 27824 4 irq_work_single
0.00% 0.00% 1.61% 0x0 28151 6 irq_work_queue_on
0.57% 0.00% 0.00% 0x18 21822 8 native_queued_spin_lock_sl
0.28% 2.27% 0.00% 0x18 16101 10 native_queued_spin_lock_sl
0.57% 0.00% 0.00% 0x18 33199 5 native_queued_spin_lock_sl
0.00% 0.00% 1.61% 0x18 10908 32 _raw_spin_lock
0.00% 0.00% 1.61% 0x18 59770 2 _raw_spin_lock
0.00% 0.00% 1.61% 0x18 0 1 _raw_spin_unlock
1.42% 0.00% 0.00% 0x20 12918 20 pull_rt_task
0.85% 0.00% 25.81% 0x24 31123 199 pull_rt_task
0.85% 0.00% 3.23% 0x24 38218 24 pull_rt_task
0.57% 4.55% 19.35% 0x28 30558 207 pull_rt_task
0.28% 0.00% 0.00% 0x28 55504 10 pull_rt_task
18.70% 18.18% 0.00% 0x30 26438 291 dequeue_pushable_task
17.28% 22.73% 0.00% 0x30 29347 281 enqueue_pushable_task
1.70% 2.27% 0.00% 0x30 12819 31 enqueue_pushable_task
0.28% 0.00% 0.00% 0x30 17726 18 dequeue_pushable_task
34.56% 29.55% 0.00% 0x38 25509 527 cpupri_find_fitness
13.88% 11.36% 24.19% 0x38 30654 342 cpupri_set
3.12% 2.27% 0.00% 0x38 18093 39 cpupri_set
1.70% 0.00% 0.00% 0x38 37661 52 cpupri_find_fitness
1.42% 2.27% 19.35% 0x38 31110 211 cpupri_set
1.42% 0.00% 1.61% 0x38 45035 31 cpupri_set
3. Perf c2c report of cpupri's last cache line
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
149 43 41 0xff14d42c400e3ec0
------- ------- ------ ------ ------ ------ ------------------------
8.72% 11.63% 0.00% 0x8 2001 165 cpupri_find_fitness
1.34% 2.33% 0.00% 0x18 1456 151 cpupri_find_fitness
8.72% 9.30% 58.54% 0x28 1744 263 cpupri_set
2.01% 4.65% 41.46% 0x28 1958 301 cpupri_set
1.34% 0.00% 0.00% 0x28 10580 6 cpupri_set
69.80% 67.44% 0.00% 0x30 1754 347 cpupri_set
8.05% 4.65% 0.00% 0x30 2144 256 cpupri_set
Signed-off-by: Pan Deng <pan.deng@intel.com>
Signed-off-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Note: The side effect of this change is that struct cpupri size is
increased from 26 cache lines to 203 cache lines.
An alternative implementation of this patch could be separating `counts`
and `masks` into 2 vectors in cpupri_vec (counts[] and masks[]), and
add two paddings:
1. Between counts[0] and counts[1], since counts[0] is more frequently
updated than others.
2. Between the two vectors, since counts[] is read-write access while
masks[] is read access when it stores pointers.
The alternative introduces the complexity of 31+/21- LoC changes,
it achieves almost the same performance, at the same time, struct cpupri
size is reduced from 26 cache lines to 21 cache lines.
---
kernel/sched/cpupri.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..245b0fa626be 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -9,7 +9,7 @@
struct cpupri_vec {
atomic_t count;
- cpumask_var_t mask;
+ cpumask_var_t mask ____cacheline_aligned;
};
struct cpupri {
--
2.43.5
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention
2025-07-21 6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
2025-07-21 6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
@ 2025-07-21 6:10 ` Pan Deng
2025-07-21 6:10 ` [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
2025-07-21 6:10 ` [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
3 siblings, 0 replies; 5+ messages in thread
From: Pan Deng @ 2025-07-21 6:10 UTC (permalink / raw)
To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng
When running a multi-instance FFmpeg workload on HCC system, significant
contention is observed in root_domain cacheline 1 and 3.
The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.
perf c2c tool reveals (sorted by contention severity):
root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored,
since counts[0] is more frequently updated than others along with a
rt task enqueues an empty runq or dequeues from a non-overloaded runq.
- `rto_mask` (0x30) is heavily loaded
- `rto_loop_next` (0x24) and `rto_loop_start` (0x28) are frequently stored
- `rto_push_work` (0x0) and `rto_lock` (0x18) are lightly accessed
- cycles per load: ~10K to 59K
root_domain cache line 1:
- `rto_count` (0x4) is frequently loaded/stored
- `overloaded` (0x28) is heavily loaded
- cycles per load: ~2.8K to 44K:
This change adjusts the layout of `root_domain` to isolate these contended
fields across separate cache lines:
1. `rto_count` remains in the 1st cache line; `overloaded` and
`overutilized` are moved to the last cache line
2. `rto_push_work` is placed in the 2nd cache line
3. `rto_loop_start`, `rto_loop_next`, and `rto_lock` remain in the 3rd
cache line; `rto_mask` is moved near `pd` in the penultimate cache line
4. `cpupri` starts at the 4th cache line to prevent `pri_to_cpu[0].count`
contending with fields in cache line 3.
With this change:
- FPS improves by ~5%
- Kernel cycles% drops from ~20% to ~17.7%
- root_domain cache line 3 no longer appears in perf-c2c report
- cycles per load of root_domain cache line 1 is reduced to from
~2.8K-44K to ~2.1K-2.7K
- stress-ng cyclic benchmark is improved ~18.6%, command:
stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \
--timeout 30 --minimize --metrics
- rt-tests/pi_stress is improved ~4.7%, command:
rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))
According to the nature of the change, to my understanding, it doesn`t
introduce any negative impact in other scenario.
Note: This change increases the size of `root_domain` from 29 to 31 cache
lines, it's considered acceptable since `root_domain` is a single global
object.
Appendix:
1. Current layout of contended data structure:
struct root_domain {
atomic_t refcount; /* 0 4 */
atomic_t rto_count; /* 4 4 */
struct callback_head rcu __attribute__((__aligned__(8)));/*8 16 */
cpumask_var_t span; /* 24 8 */
cpumask_var_t online; /* 32 8 */
bool overloaded; /* 40 1 */
bool overutilized; /* 41 1 */
/* XXX 6 bytes hole, try to pack */
cpumask_var_t dlo_mask; /* 48 8 */
atomic_t dlo_count; /* 56 4 */
/* XXX 4 bytes hole, try to pack */
/* --- cacheline 1 boundary (64 bytes) --- */
struct dl_bw dl_bw; /* 64 24 */
struct cpudl cpudl; /* 88 24 */
u64 visit_gen; /* 112 8 */
struct irq_work rto_push_work; /* 120 32 */
/* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
raw_spinlock_t rto_lock; /* 152 4 */
int rto_loop; /* 156 4 */
int rto_cpu; /* 160 4 */
atomic_t rto_loop_next; /* 164 4 */
atomic_t rto_loop_start; /* 168 4 */
/* XXX 4 bytes hole, try to pack */
cpumask_var_t rto_mask; /* 176 8 */
/* --- cacheline 3 boundary (192 bytes) was 8 bytes hence --- */
struct cpupri cpupri; /* 184 1624 */
...
} __attribute__((__aligned__(8)));
2. Perf c2c report of root_domain cache line 3:
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
353 44 62 0xff14d42c400e3880
------- ------- ------ ------ ------ ------ ------------------------
0.00% 2.27% 0.00% 0x0 21683 6 __flush_smp_call_function_
0.00% 2.27% 0.00% 0x0 22294 5 __flush_smp_call_function_
0.28% 0.00% 0.00% 0x0 0 2 irq_work_queue_on
0.28% 0.00% 0.00% 0x0 27824 4 irq_work_single
0.00% 0.00% 1.61% 0x0 28151 6 irq_work_queue_on
0.57% 0.00% 0.00% 0x18 21822 8 native_queued_spin_lock_sl
0.28% 2.27% 0.00% 0x18 16101 10 native_queued_spin_lock_sl
0.57% 0.00% 0.00% 0x18 33199 5 native_queued_spin_lock_sl
0.00% 0.00% 1.61% 0x18 10908 32 _raw_spin_lock
0.00% 0.00% 1.61% 0x18 59770 2 _raw_spin_lock
0.00% 0.00% 1.61% 0x18 0 1 _raw_spin_unlock
1.42% 0.00% 0.00% 0x20 12918 20 pull_rt_task
0.85% 0.00% 25.81% 0x24 31123 199 pull_rt_task
0.85% 0.00% 3.23% 0x24 38218 24 pull_rt_task
0.57% 4.55% 19.35% 0x28 30558 207 pull_rt_task
0.28% 0.00% 0.00% 0x28 55504 10 pull_rt_task
18.70% 18.18% 0.00% 0x30 26438 291 dequeue_pushable_task
17.28% 22.73% 0.00% 0x30 29347 281 enqueue_pushable_task
1.70% 2.27% 0.00% 0x30 12819 31 enqueue_pushable_task
0.28% 0.00% 0.00% 0x30 17726 18 dequeue_pushable_task
34.56% 29.55% 0.00% 0x38 25509 527 cpupri_find_fitness
13.88% 11.36% 24.19% 0x38 30654 342 cpupri_set
3.12% 2.27% 0.00% 0x38 18093 39 cpupri_set
1.70% 0.00% 0.00% 0x38 37661 52 cpupri_find_fitness
1.42% 2.27% 19.35% 0x38 31110 211 cpupri_set
1.42% 0.00% 1.61% 0x38 45035 31 cpupri_set
3. Perf c2c report of root_domain cache line 1:
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
231 43 48 0xff14d42c400e3800
------- ------- ------ ------ ------ ------ ------------------------
22.51% 18.60% 0.00% 0x4 5041 247 pull_rt_task
5.63% 2.33% 45.83% 0x4 6995 315 dequeue_pushable_task
3.90% 4.65% 54.17% 0x4 6587 370 enqueue_pushable_task
0.43% 0.00% 0.00% 0x4 17111 4 enqueue_pushable_task
0.43% 0.00% 0.00% 0x4 44062 4 dequeue_pushable_task
32.03% 27.91% 0.00% 0x28 6393 285 enqueue_task_rt
16.45% 27.91% 0.00% 0x28 5534 139 sched_balance_newidle
14.72% 18.60% 0.00% 0x28 5287 110 dequeue_task_rt
3.46% 0.00% 0.00% 0x28 2820 25 enqueue_task_fair
0.43% 0.00% 0.00% 0x28 220 3 enqueue_task_stop
Signed-off-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
kernel/sched/sched.h | 52 +++++++++++++++++++++++---------------------
1 file changed, 27 insertions(+), 25 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 83e3aa917142..bc67806911f2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -968,24 +968,29 @@ struct root_domain {
cpumask_var_t span;
cpumask_var_t online;
+ atomic_t dlo_count;
+ struct dl_bw dl_bw;
+ struct cpudl cpudl;
+
+#ifdef HAVE_RT_PUSH_IPI
/*
- * Indicate pullable load on at least one CPU, e.g:
- * - More than one runnable task
- * - Running task is misfit
+ * For IPI pull requests, loop across the rto_mask.
*/
- bool overloaded;
-
- /* Indicate one or more CPUs over-utilized (tipping point) */
- bool overutilized;
+ struct irq_work rto_push_work;
+ raw_spinlock_t rto_lock;
+ /* These are only updated and read within rto_lock */
+ int rto_loop;
+ int rto_cpu;
+ /* These atomics are updated outside of a lock */
+ atomic_t rto_loop_next;
+ atomic_t rto_loop_start;
+#endif
/*
* The bit corresponding to a CPU gets set here if such CPU has more
* than one runnable -deadline task (as it is below for RT tasks).
*/
cpumask_var_t dlo_mask;
- atomic_t dlo_count;
- struct dl_bw dl_bw;
- struct cpudl cpudl;
/*
* Indicate whether a root_domain's dl_bw has been checked or
@@ -995,32 +1000,29 @@ struct root_domain {
* that u64 is 'big enough'. So that shouldn't be a concern.
*/
u64 visit_cookie;
+ struct cpupri cpupri ____cacheline_aligned;
-#ifdef HAVE_RT_PUSH_IPI
/*
- * For IPI pull requests, loop across the rto_mask.
+ * NULL-terminated list of performance domains intersecting with the
+ * CPUs of the rd. Protected by RCU.
*/
- struct irq_work rto_push_work;
- raw_spinlock_t rto_lock;
- /* These are only updated and read within rto_lock */
- int rto_loop;
- int rto_cpu;
- /* These atomics are updated outside of a lock */
- atomic_t rto_loop_next;
- atomic_t rto_loop_start;
-#endif
+ struct perf_domain __rcu *pd ____cacheline_aligned;
+
/*
* The "RT overload" flag: it gets set if a CPU has more than
* one runnable RT task.
*/
cpumask_var_t rto_mask;
- struct cpupri cpupri;
/*
- * NULL-terminated list of performance domains intersecting with the
- * CPUs of the rd. Protected by RCU.
+ * Indicate pullable load on at least one CPU, e.g:
+ * - More than one runnable task
+ * - Running task is misfit
*/
- struct perf_domain __rcu *pd;
+ bool overloaded ____cacheline_aligned;
+
+ /* Indicate one or more CPUs over-utilized (tipping point) */
+ bool overutilized;
};
extern void init_defrootdomain(void);
--
2.43.5
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters
2025-07-21 6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
2025-07-21 6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
2025-07-21 6:10 ` [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
@ 2025-07-21 6:10 ` Pan Deng
2025-07-21 6:10 ` [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
3 siblings, 0 replies; 5+ messages in thread
From: Pan Deng @ 2025-07-21 6:10 UTC (permalink / raw)
To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng
When running a multi-instance FFmpeg workload on HCC system, significant
contention is observed on root_domain `rto_count` and `overloaded` fields.
The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.
perf c2c tool reveals:
root_domain cache line 1:
- `rto_count` (0x4) is frequently loaded/stored
- `overloaded` (0x28) is heavily loaded
- cycles per load: ~2.8K to 44K:
A separate patch rearranges root_domain to place `overloaded` on a
different cache line, but this alone is insufficient to resolve the
contention on `rto_count`. As a complementary, this patch splits
`rto_count` into per-numa-node counters to reduce the contention.
With this change:
- FPS improves by ~4%
- Kernel cycles% drops from ~20% to ~18.6%
- The cache line no longer appears in perf-c2c report
- stress-ng cyclic benchmark is improved ~50.4%, command:
stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \
--timeout 30 --minimize --metrics
- rt-tests/pi_stress is improved ~5.1%, command:
rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))
Appendix:
1. Perf c2c report of root_domain cache line 1:
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
231 43 48 0xff14d42c400e3800
------- ------- ------ ------ ------ ------ ------------------------
22.51% 18.60% 0.00% 0x4 5041 247 pull_rt_task
5.63% 2.33% 45.83% 0x4 6995 315 dequeue_pushable_task
3.90% 4.65% 54.17% 0x4 6587 370 enqueue_pushable_task
0.43% 0.00% 0.00% 0x4 17111 4 enqueue_pushable_task
0.43% 0.00% 0.00% 0x4 44062 4 dequeue_pushable_task
32.03% 27.91% 0.00% 0x28 6393 285 enqueue_task_rt
16.45% 27.91% 0.00% 0x28 5534 139 sched_balance_newidle
14.72% 18.60% 0.00% 0x28 5287 110 dequeue_task_rt
3.46% 0.00% 0.00% 0x28 2820 25 enqueue_task_fair
0.43% 0.00% 0.00% 0x28 220 3 enqueue_task_stop
Signed-off-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
V1 -> V2: Fixed non CONFIG_SMP build issue
---
kernel/sched/rt.c | 56 ++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 9 ++++++-
kernel/sched/topology.c | 7 ++++++
3 files changed, 68 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e40422c37033..cbcfd3aa3439 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -337,9 +337,58 @@ static inline bool need_pull_rt_task(struct rq *rq, struct task_struct *prev)
return rq->online && rq->rt.highest_prio.curr > prev->prio;
}
+int rto_counts_init(atomic_tp **rto_counts)
+{
+ int i;
+ atomic_tp *counts = kzalloc(nr_node_ids * sizeof(atomic_tp), GFP_KERNEL);
+
+ if (!counts)
+ return -ENOMEM;
+
+ for (i = 0; i < nr_node_ids; i++) {
+ counts[i] = kzalloc_node(sizeof(atomic_t), GFP_KERNEL, i);
+
+ if (!counts[i])
+ goto cleanup;
+ }
+
+ *rto_counts = counts;
+ return 0;
+
+cleanup:
+ while (i--)
+ kfree(counts[i]);
+
+ kfree(counts);
+ return -ENOMEM;
+}
+
+void rto_counts_cleanup(atomic_tp *rto_counts)
+{
+ for (int i = 0; i < nr_node_ids; i++)
+ kfree(rto_counts[i]);
+
+ kfree(rto_counts);
+}
+
static inline int rt_overloaded(struct rq *rq)
{
- return atomic_read(&rq->rd->rto_count);
+ int count = 0;
+ int cur_node, nid;
+
+ cur_node = numa_node_id();
+
+ for (int i = 0; i < nr_node_ids; i++) {
+ nid = (cur_node + i) % nr_node_ids;
+ count += atomic_read(rq->rd->rto_counts[nid]);
+
+ // The caller only checks if it is 0
+ // or 1, so that return once > 1
+ if (count > 1)
+ return count;
+ }
+
+ return count;
}
static inline void rt_set_overload(struct rq *rq)
@@ -358,7 +407,7 @@ static inline void rt_set_overload(struct rq *rq)
* Matched by the barrier in pull_rt_task().
*/
smp_wmb();
- atomic_inc(&rq->rd->rto_count);
+ atomic_inc(rq->rd->rto_counts[cpu_to_node(rq->cpu)]);
}
static inline void rt_clear_overload(struct rq *rq)
@@ -367,7 +416,7 @@ static inline void rt_clear_overload(struct rq *rq)
return;
/* the order here really doesn't matter */
- atomic_dec(&rq->rd->rto_count);
+ atomic_dec(rq->rd->rto_counts[cpu_to_node(rq->cpu)]);
cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask);
}
@@ -443,6 +492,7 @@ static inline void dequeue_pushable_task(struct rq *rq, struct task_struct *p)
static inline void rt_queue_push_tasks(struct rq *rq)
{
}
+
#endif /* CONFIG_SMP */
static void enqueue_top_rt_rq(struct rt_rq *rt_rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bc67806911f2..13fc3ac3381b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -953,6 +953,8 @@ struct perf_domain {
struct rcu_head rcu;
};
+typedef atomic_t *atomic_tp;
+
/*
* We add the notion of a root-domain which will be used to define per-domain
* variables. Each exclusive cpuset essentially defines an island domain by
@@ -963,12 +965,15 @@ struct perf_domain {
*/
struct root_domain {
atomic_t refcount;
- atomic_t rto_count;
struct rcu_head rcu;
cpumask_var_t span;
cpumask_var_t online;
atomic_t dlo_count;
+
+ /* rto_count per node */
+ atomic_tp *rto_counts;
+
struct dl_bw dl_bw;
struct cpudl cpudl;
@@ -1030,6 +1035,8 @@ extern int sched_init_domains(const struct cpumask *cpu_map);
extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
extern void sched_get_rd(struct root_domain *rd);
extern void sched_put_rd(struct root_domain *rd);
+extern int rto_counts_init(atomic_tp **rto_counts);
+extern void rto_counts_cleanup(atomic_tp *rto_counts);
static inline int get_rd_overloaded(struct root_domain *rd)
{
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b958fe48e020..166dc8177a44 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -457,6 +457,7 @@ static void free_rootdomain(struct rcu_head *rcu)
{
struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
+ rto_counts_cleanup(rd->rto_counts);
cpupri_cleanup(&rd->cpupri);
cpudl_cleanup(&rd->cpudl);
free_cpumask_var(rd->dlo_mask);
@@ -549,8 +550,14 @@ static int init_rootdomain(struct root_domain *rd)
if (cpupri_init(&rd->cpupri) != 0)
goto free_cpudl;
+
+ if (rto_counts_init(&rd->rto_counts) != 0)
+ goto free_cpupri;
+
return 0;
+free_cpupri:
+ cpupri_cleanup(&rd->cpupri);
free_cpudl:
cpudl_cleanup(&rd->cpudl);
free_rto_mask:
--
2.43.5
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention
2025-07-21 6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
` (2 preceding siblings ...)
2025-07-21 6:10 ` [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
@ 2025-07-21 6:10 ` Pan Deng
3 siblings, 0 replies; 5+ messages in thread
From: Pan Deng @ 2025-07-21 6:10 UTC (permalink / raw)
To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng
When running a multi-instance FFmpeg workload on HCC system, significant
contention is observed on bitmap of `cpupri_vec->cpumask`.
The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.
perf c2c tool reveals:
cpumask (bitmap) cache line of `cpupri_vec->mask`:
- bits are loaded during cpupri_find
- bits are stored during cpupri_set
- cycles per load: ~2.2K to 8.7K
This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
mitigate false sharing.
As a result:
- FPS improves by ~3.8%
- Kernel cycles% drops from ~20% to ~18.7%
- Cache line contention is mitigated, perf-c2c shows cycles per load
drops from ~2.2K-8.7K to ~0.5K-2.2K
- stress-ng cyclic benchmark is improved ~5.9%, command:
stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \
--timeout 30 --minimize --metrics
- rt-tests/pi_stress is improved ~9.3%, command:
rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))
Note: CONFIG_CPUMASK_OFFSTACK=n remains unchanged.
Appendix:
1. Perf c2c report of `cpupri_vec->mask` bitmap cache line:
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
155 39 39 0xff14d52c4682d800
------- ------- ------ ------ ------ ------ ------------------------
43.23% 43.59% 0.00% 0x0 3489 415 _find_first_and_bit
3.23% 5.13% 0.00% 0x0 3478 107 __bitmap_and
3.23% 0.00% 0.00% 0x0 2712 33 _find_first_and_bit
1.94% 0.00% 7.69% 0x0 5992 33 cpupri_set
0.00% 0.00% 5.13% 0x0 3733 19 cpupri_set
12.90% 12.82% 0.00% 0x8 3452 297 _find_first_and_bit
1.29% 2.56% 0.00% 0x8 3007 117 __bitmap_and
0.00% 5.13% 0.00% 0x8 3041 20 _find_first_and_bit
0.00% 2.56% 2.56% 0x8 2374 22 cpupri_set
0.00% 0.00% 7.69% 0x8 4194 38 cpupri_set
8.39% 2.56% 0.00% 0x10 3336 264 _find_first_and_bit
3.23% 0.00% 0.00% 0x10 3023 46 _find_first_and_bit
2.58% 0.00% 0.00% 0x10 3040 130 __bitmap_and
1.29% 0.00% 12.82% 0x10 4075 34 cpupri_set
0.00% 0.00% 2.56% 0x10 2197 19 cpupri_set
0.00% 2.56% 7.69% 0x18 4085 27 cpupri_set
0.00% 2.56% 0.00% 0x18 3128 220 _find_first_and_bit
0.00% 0.00% 5.13% 0x18 3028 20 cpupri_set
2.58% 2.56% 0.00% 0x20 3089 198 _find_first_and_bit
1.29% 0.00% 5.13% 0x20 5114 29 cpupri_set
0.65% 2.56% 0.00% 0x20 3224 96 __bitmap_and
0.65% 0.00% 7.69% 0x20 4392 31 cpupri_set
2.58% 0.00% 0.00% 0x28 3327 214 _find_first_and_bit
0.65% 2.56% 5.13% 0x28 5252 31 cpupri_set
0.65% 0.00% 7.69% 0x28 8755 25 cpupri_set
0.65% 0.00% 0.00% 0x28 4414 14 _find_first_and_bit
1.29% 2.56% 0.00% 0x30 3139 171 _find_first_and_bit
0.65% 0.00% 7.69% 0x30 2185 18 cpupri_set
0.65% 0.00% 0.00% 0x30 3404 108 __bitmap_and
0.00% 0.00% 2.56% 0x30 5542 21 cpupri_set
3.23% 5.13% 0.00% 0x38 3493 190 _find_first_and_bit
3.23% 2.56% 0.00% 0x38 3171 108 __bitmap_and
0.00% 2.56% 7.69% 0x38 3285 14 cpupri_set
0.00% 0.00% 5.13% 0x38 4035 27 cpupri_set
Signed-off-by: Pan Deng <pan.deng@intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
kernel/sched/cpupri.c | 200 ++++++++++++++++++++++++++++++++++++++----
kernel/sched/cpupri.h | 4 +
2 files changed, 186 insertions(+), 18 deletions(-)
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index 42c40cfdf836..306b6baff4cd 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -64,6 +64,143 @@ static int convert_prio(int prio)
return cpupri;
}
+#ifdef CONFIG_CPUMASK_OFFSTACK
+static inline int alloc_vec_masks(struct cpupri_vec *vec)
+{
+ int i;
+
+ for (i = 0; i < nr_node_ids; i++) {
+ if (!zalloc_cpumask_var_node(&vec->masks[i], GFP_KERNEL, i))
+ goto cleanup;
+
+ // Clear masks of cur node, set others
+ bitmap_complement(cpumask_bits(vec->masks[i]),
+ cpumask_bits(cpumask_of_node(i)), small_cpumask_bits);
+ }
+ return 0;
+
+cleanup:
+ while (i--)
+ free_cpumask_var(vec->masks[i]);
+ return -ENOMEM;
+}
+
+static inline void free_vec_masks(struct cpupri_vec *vec)
+{
+ for (int i = 0; i < nr_node_ids; i++)
+ free_cpumask_var(vec->masks[i]);
+}
+
+static inline int setup_vec_mask_var_ts(struct cpupri *cp)
+{
+ int i;
+
+ for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+ struct cpupri_vec *vec = &cp->pri_to_cpu[i];
+
+ vec->masks = kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL);
+ if (!vec->masks)
+ goto cleanup;
+ }
+ return 0;
+
+cleanup:
+ /* Free any already allocated masks */
+ while (i--) {
+ kfree(cp->pri_to_cpu[i].masks);
+ cp->pri_to_cpu[i].masks = NULL;
+ }
+
+ return -ENOMEM;
+}
+
+static inline void free_vec_mask_var_ts(struct cpupri *cp)
+{
+ for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+ kfree(cp->pri_to_cpu[i].masks);
+ cp->pri_to_cpu[i].masks = NULL;
+ }
+}
+
+static inline int
+available_cpu_in_nodes(struct task_struct *p, struct cpupri_vec *vec)
+{
+ int cur_node = numa_node_id();
+
+ for (int i = 0; i < nr_node_ids; i++) {
+ int nid = (cur_node + i) % nr_node_ids;
+
+ if (cpumask_first_and_and(&p->cpus_mask, vec->masks[nid],
+ cpumask_of_node(nid)) < nr_cpu_ids)
+ return 1;
+ }
+
+ return 0;
+}
+
+#define available_cpu_in_vec available_cpu_in_nodes
+
+#else /* !CONFIG_CPUMASK_OFFSTACK */
+
+static inline int alloc_vec_masks(struct cpupri_vec *vec)
+{
+ if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL))
+ return -ENOMEM;
+
+ return 0;
+}
+
+static inline void free_vec_masks(struct cpupri_vec *vec)
+{
+ free_cpumask_var(vec->mask);
+}
+
+static inline int setup_vec_mask_var_ts(struct cpupri *cp)
+{
+ return 0;
+}
+
+static inline void free_vec_mask_var_ts(struct cpupri *cp)
+{
+}
+
+static inline int
+available_cpu_in_vec(struct task_struct *p, struct cpupri_vec *vec)
+{
+ if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids)
+ return 0;
+
+ return 1;
+}
+#endif
+
+static inline int alloc_all_masks(struct cpupri *cp)
+{
+ int i;
+
+ for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+ if (alloc_vec_masks(&cp->pri_to_cpu[i]))
+ goto cleanup;
+ }
+
+ return 0;
+
+cleanup:
+ while (i--)
+ free_vec_masks(&cp->pri_to_cpu[i]);
+
+ return -ENOMEM;
+}
+
+static inline void setup_vec_counts(struct cpupri *cp)
+{
+ for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
+ struct cpupri_vec *vec = &cp->pri_to_cpu[i];
+
+ atomic_set(&vec->count, 0);
+ }
+}
+
static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
struct cpumask *lowest_mask, int idx)
{
@@ -96,11 +233,24 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
if (skip)
return 0;
- if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids)
+ if (!available_cpu_in_vec(p, vec))
return 0;
+#ifdef CONFIG_CPUMASK_OFFSTACK
+ struct cpumask *cpupri_mask = lowest_mask;
+
+ // available && lowest_mask
+ if (lowest_mask) {
+ cpumask_copy(cpupri_mask, vec->masks[0]);
+ for (int nid = 1; nid < nr_node_ids; nid++)
+ cpumask_and(cpupri_mask, cpupri_mask, vec->masks[nid]);
+ }
+#else
+ struct cpumask *cpupri_mask = vec->mask;
+#endif
+
if (lowest_mask) {
- cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+ cpumask_and(lowest_mask, &p->cpus_mask, cpupri_mask);
cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
/*
@@ -229,7 +379,11 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
if (likely(newpri != CPUPRI_INVALID)) {
struct cpupri_vec *vec = &cp->pri_to_cpu[newpri];
+#ifdef CONFIG_CPUMASK_OFFSTACK
+ cpumask_set_cpu(cpu, vec->masks[cpu_to_node(cpu)]);
+#else
cpumask_set_cpu(cpu, vec->mask);
+#endif
/*
* When adding a new vector, we update the mask first,
* do a write memory barrier, and then update the count, to
@@ -263,7 +417,11 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri)
*/
atomic_dec(&(vec)->count);
smp_mb__after_atomic();
+#ifdef CONFIG_CPUMASK_OFFSTACK
+ cpumask_clear_cpu(cpu, vec->masks[cpu_to_node(cpu)]);
+#else
cpumask_clear_cpu(cpu, vec->mask);
+#endif
}
*currpri = newpri;
@@ -279,26 +437,31 @@ int cpupri_init(struct cpupri *cp)
{
int i;
- for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) {
- struct cpupri_vec *vec = &cp->pri_to_cpu[i];
-
- atomic_set(&vec->count, 0);
- if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL))
- goto cleanup;
- }
-
+ /* Allocate the cpu_to_pri array */
cp->cpu_to_pri = kcalloc(nr_cpu_ids, sizeof(int), GFP_KERNEL);
if (!cp->cpu_to_pri)
- goto cleanup;
+ return -ENOMEM;
+ /* Initialize all CPUs to invalid priority */
for_each_possible_cpu(i)
cp->cpu_to_pri[i] = CPUPRI_INVALID;
+ /* Setup priority vectors */
+ setup_vec_counts(cp);
+ if (setup_vec_mask_var_ts(cp))
+ goto fail_setup_vectors;
+
+ /* Allocate masks for each priority vector */
+ if (alloc_all_masks(cp))
+ goto fail_alloc_masks;
+
return 0;
-cleanup:
- for (i--; i >= 0; i--)
- free_cpumask_var(cp->pri_to_cpu[i].mask);
+fail_alloc_masks:
+ free_vec_mask_var_ts(cp);
+
+fail_setup_vectors:
+ kfree(cp->cpu_to_pri);
return -ENOMEM;
}
@@ -308,9 +471,10 @@ int cpupri_init(struct cpupri *cp)
*/
void cpupri_cleanup(struct cpupri *cp)
{
- int i;
-
kfree(cp->cpu_to_pri);
- for (i = 0; i < CPUPRI_NR_PRIORITIES; i++)
- free_cpumask_var(cp->pri_to_cpu[i].mask);
+
+ for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++)
+ free_vec_masks(&cp->pri_to_cpu[i]);
+
+ free_vec_mask_var_ts(cp);
}
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index 245b0fa626be..c53f1f4dad86 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -9,7 +9,11 @@
struct cpupri_vec {
atomic_t count;
+#ifdef CONFIG_CPUMASK_OFFSTACK
+ cpumask_var_t *masks ____cacheline_aligned;
+#else
cpumask_var_t mask ____cacheline_aligned;
+#endif
};
struct cpupri {
--
2.43.5
^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-07-21 6:06 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-21 6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
2025-07-21 6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
2025-07-21 6:10 ` [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
2025-07-21 6:10 ` [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
2025-07-21 6:10 ` [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).