* [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention
@ 2025-07-21 6:10 Pan Deng
2025-07-21 6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng
` (4 more replies)
0 siblings, 5 replies; 41+ messages in thread
From: Pan Deng @ 2025-07-21 6:10 UTC (permalink / raw)
To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng
When running multi-instance FFmpeg workload in cloud environment,
cache line contention is severe during the access to root_domain data
structures, which significantly degrades performance.
The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS(frame per second) is used as score.
Profiling shows the kernel consumes ~20% of CPU cycles, which is
excessive in this scenario. The overhead primarily comes from RT task
scheduling functions like `cpupri_set`, `cpupri_find_fitness`,
`dequeue_pushable_task`, `enqueue_pushable_task`, `pull_rt_task`,
`__find_first_and_bit`, and `__bitmap_and`. This is due to read/write
contention on root_domain cache lines.
The `perf c2c` report, sorted by contention severity, reveals:
root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` is heavily loaded/stored,
since counts[0] is more frequently updated than others along with a
rt task enqueues an empty runq or dequeues from a non-overloaded runq.
- `rto_mask` is heavily loaded
- `rto_loop_next` and `rto_loop_start` are frequently stored
- `rto_push_work` and `rto_lock` are lightly accessed
- cycles per load: ~10K to 59K.
root_domain cache line 1:
- `rto_count` is frequently loaded/stored
- `overloaded` is heavily loaded
- cycles per load: ~2.8K to 44K
cpumask (bitmap) cache line of cpupri_vec->mask:
- bits are loaded during cpupri_find
- bits are stored during cpupri_set
- cycles per load: ~2.2K to 8.7K
The end cache line of cpupri:
- `cpupri_vec->count` and `mask` contends. The transcoding threads use
rt pri 99, so that the contention occurs in the end.
- cycles per load: ~1.5K to 10.5K
According to above, we propose 4 patches to mitigate the contention,
each patch resolves part of above issues:
Patch 1: Reorganize `cpupri_vec`, separate `count`, `mask` fields,
reducing contention on root_domain cache line 3 and cpupri's
last cache line. This patch has an alternative implementation,
which is described in the patch commit message, welcome any
comments.
Patch 2: Restructure `root_domain` structure to minimize contention of
root_domain cache line 1 and 3 by reordering fields.
Patch 3: Split `root_domain->rto_count` to per-NUMA-node counters,
reducing the contention on root_domain cache line 1.
Patch 4: Split `cpupri_vec->cpumask` to per-NUMA-node bitmaps, reducing
load/store contention on the cpumask bitmap cache line.
Evaluation:
The patches are tested non-cumulatively, I'm happly to provide additional
data as needed.
FFmpeg benchmark:
Performance changes (FPS):
- Baseline: 100.0%
- Baseline + Patch 1: 111.0%
- Baseline + Patch 2: 105.0%
- Baseline + Patch 3: 104.0%
- Baseline + Patch 4: 103.8%
Kernel CPU cycle usage(lower is better):
- Baseline: 20.0%
- Baseline + Patch 1: 11.0%
- Baseline + Patch 2: 17.7%
- Baseline + Patch 3: 18.6%
- Baseline + Patch 4: 18.7%
Cycles per load reduction (by perf c2c report):
- Patch 1:
- `root_domain` cache line 3: 10K–59K -> 0.5K–8K
- `cpupri` last cache line: 1.5K–10.5K -> eliminated
- Patch 2:
- `root_domain` cache line 1: 2.8K–44K -> 2.1K–2.7K
- `root_domain` cache line 3: 10K–59K -> eliminated
- Patch 3:
- `root_domain` cache line 1: 2.8K–44K -> eliminated
- Patch 4:
- `cpupri_vec->mask` cache line: 2.2K–8.7K -> 0.5K–2.2K
stress-ng rt cyclic benchmark:
Command:
stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \
--timeout 30 --minimize --metrics
Performance changes (bogo ops/s, real time):
- Baseline: 100.0%
- Baseline + Patch 1: 131.4%
- Baseline + Patch 2: 118.6%
- Baseline + Patch 3: 150.4%
- Baseline + Patch 4: 105.9%
rt-tests pi_stress benchmark:
Command:
rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))
Performance changes (Total inversions performed):
- Baseline: 100.0%
- Baseline + Patch 1: 176.5%
- Baseline + Patch 2: 104.7%
- Baseline + Patch 3: 105.1%
- Baseline + Patch 4: 109.3%
Changes since v1:
- Patch 3: Fixed non CONFIG_SMP build issue.
- Patch 1-4: Added stress-ng/cyclic and rt-tests/pi_stress test result.
Comments are appreciated, I'm looking forward to hearing feedback
making revisions, thanks a lot!
Pan Deng (4):
sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
sched/rt: Restructure root_domain to reduce cacheline contention
sched/rt: Split root_domain->rto_count to per-NUMA-node counters
sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce
contention
kernel/sched/cpupri.c | 200 ++++++++++++++++++++++++++++++++++++----
kernel/sched/cpupri.h | 6 +-
kernel/sched/rt.c | 56 ++++++++++-
kernel/sched/sched.h | 61 ++++++------
kernel/sched/topology.c | 7 ++
5 files changed, 282 insertions(+), 48 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 41+ messages in thread* [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention 2025-07-21 6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng @ 2025-07-21 6:10 ` Pan Deng 2026-03-20 10:09 ` Peter Zijlstra 2026-04-08 10:16 ` Chen, Yu C 2025-07-21 6:10 ` [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng ` (3 subsequent siblings) 4 siblings, 2 replies; 41+ messages in thread From: Pan Deng @ 2025-07-21 6:10 UTC (permalink / raw) To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng When running a multi-instance FFmpeg workload on an HCC system, significant cache line contention is observed around `cpupri_vec->count` and `mask` in struct root_domain. The SUT is a 2-socket machine with 240 physical cores and 480 logical CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 with FIFO scheduling. FPS is used as score. perf c2c tool reveals: root_domain cache line 3: - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored and contends with other fields, since counts[0] is more frequently updated than others along with a rt task enqueues an empty runq or dequeues from a non-overloaded runq. - cycles per load: ~10K to 59K cpupri's last cache line: - `cpupri_vec->count` and `mask` contends. The transcoding threads use rt pri 99, so that the contention occurs in the end. - cycles per load: ~1.5K to 10.5K This change mitigates `cpupri_vec->count`, `mask` related contentions by separating each count and mask into different cache lines. As a result: - FPS improves by ~11% - Kernel cycles% drops from ~20% to ~11% - `count` and `mask` related cache line contention is mitigated, perf c2c shows root_domain cache line 3 `cycles per load` drops from ~10K-59K to ~0.5K-8K, cpupri's last cache line no longer appears in the report. - stress-ng cyclic benchmark is improved ~31.4%, command: stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \ --timeout 30 --minimize --metrics - rt-tests/pi_stress is improved ~76.5%, command: rt-tests/pi_stress -D 30 -g $(($(nproc) / 2)) Appendix: 1. Current layout of contended data structure: struct root_domain { ... struct irq_work rto_push_work; /* 120 32 */ /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */ raw_spinlock_t rto_lock; /* 152 4 */ int rto_loop; /* 156 4 */ int rto_cpu; /* 160 4 */ atomic_t rto_loop_next; /* 164 4 */ atomic_t rto_loop_start; /* 168 4 */ /* XXX 4 bytes hole, try to pack */ cpumask_var_t rto_mask; /* 176 8 */ /* --- cacheline 3 boundary (192 bytes) was 8 bytes hence --- */ struct cpupri cpupri; /* 184 1624 */ /* --- cacheline 28 boundary (1792 bytes) was 16 bytes ago --- */ struct perf_domain * pd; /* 1808 8 */ /* size: 1816, cachelines: 29, members: 21 */ /* sum members: 1802, holes: 3, sum holes: 14 */ /* forced alignments: 1 */ /* last cacheline: 24 bytes */ } __attribute__((__aligned__(8))); 2. Perf c2c report of root_domain cache line 3: ------- ------- ------ ------ ------ ------ ------------------------ Rmt Lcl Store Data Load Total Symbol Hitm% Hitm% L1 Hit% offset cycles records ------- ------- ------ ------ ------ ------ ------------------------ 353 44 62 0xff14d42c400e3880 ------- ------- ------ ------ ------ ------ ------------------------ 0.00% 2.27% 0.00% 0x0 21683 6 __flush_smp_call_function_ 0.00% 2.27% 0.00% 0x0 22294 5 __flush_smp_call_function_ 0.28% 0.00% 0.00% 0x0 0 2 irq_work_queue_on 0.28% 0.00% 0.00% 0x0 27824 4 irq_work_single 0.00% 0.00% 1.61% 0x0 28151 6 irq_work_queue_on 0.57% 0.00% 0.00% 0x18 21822 8 native_queued_spin_lock_sl 0.28% 2.27% 0.00% 0x18 16101 10 native_queued_spin_lock_sl 0.57% 0.00% 0.00% 0x18 33199 5 native_queued_spin_lock_sl 0.00% 0.00% 1.61% 0x18 10908 32 _raw_spin_lock 0.00% 0.00% 1.61% 0x18 59770 2 _raw_spin_lock 0.00% 0.00% 1.61% 0x18 0 1 _raw_spin_unlock 1.42% 0.00% 0.00% 0x20 12918 20 pull_rt_task 0.85% 0.00% 25.81% 0x24 31123 199 pull_rt_task 0.85% 0.00% 3.23% 0x24 38218 24 pull_rt_task 0.57% 4.55% 19.35% 0x28 30558 207 pull_rt_task 0.28% 0.00% 0.00% 0x28 55504 10 pull_rt_task 18.70% 18.18% 0.00% 0x30 26438 291 dequeue_pushable_task 17.28% 22.73% 0.00% 0x30 29347 281 enqueue_pushable_task 1.70% 2.27% 0.00% 0x30 12819 31 enqueue_pushable_task 0.28% 0.00% 0.00% 0x30 17726 18 dequeue_pushable_task 34.56% 29.55% 0.00% 0x38 25509 527 cpupri_find_fitness 13.88% 11.36% 24.19% 0x38 30654 342 cpupri_set 3.12% 2.27% 0.00% 0x38 18093 39 cpupri_set 1.70% 0.00% 0.00% 0x38 37661 52 cpupri_find_fitness 1.42% 2.27% 19.35% 0x38 31110 211 cpupri_set 1.42% 0.00% 1.61% 0x38 45035 31 cpupri_set 3. Perf c2c report of cpupri's last cache line ------- ------- ------ ------ ------ ------ ------------------------ Rmt Lcl Store Data Load Total Symbol Hitm% Hitm% L1 Hit% offset cycles records ------- ------- ------ ------ ------ ------ ------------------------ 149 43 41 0xff14d42c400e3ec0 ------- ------- ------ ------ ------ ------ ------------------------ 8.72% 11.63% 0.00% 0x8 2001 165 cpupri_find_fitness 1.34% 2.33% 0.00% 0x18 1456 151 cpupri_find_fitness 8.72% 9.30% 58.54% 0x28 1744 263 cpupri_set 2.01% 4.65% 41.46% 0x28 1958 301 cpupri_set 1.34% 0.00% 0.00% 0x28 10580 6 cpupri_set 69.80% 67.44% 0.00% 0x30 1754 347 cpupri_set 8.05% 4.65% 0.00% 0x30 2144 256 cpupri_set Signed-off-by: Pan Deng <pan.deng@intel.com> Signed-off-by: Tianyou Li <tianyou.li@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> --- Note: The side effect of this change is that struct cpupri size is increased from 26 cache lines to 203 cache lines. An alternative implementation of this patch could be separating `counts` and `masks` into 2 vectors in cpupri_vec (counts[] and masks[]), and add two paddings: 1. Between counts[0] and counts[1], since counts[0] is more frequently updated than others. 2. Between the two vectors, since counts[] is read-write access while masks[] is read access when it stores pointers. The alternative introduces the complexity of 31+/21- LoC changes, it achieves almost the same performance, at the same time, struct cpupri size is reduced from 26 cache lines to 21 cache lines. --- kernel/sched/cpupri.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h index d6cba0020064..245b0fa626be 100644 --- a/kernel/sched/cpupri.h +++ b/kernel/sched/cpupri.h @@ -9,7 +9,7 @@ struct cpupri_vec { atomic_t count; - cpumask_var_t mask; + cpumask_var_t mask ____cacheline_aligned; }; struct cpupri { -- 2.43.5 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention 2025-07-21 6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng @ 2026-03-20 10:09 ` Peter Zijlstra 2026-03-24 9:36 ` Deng, Pan 2026-04-08 10:16 ` Chen, Yu C 1 sibling, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2026-03-20 10:09 UTC (permalink / raw) To: Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen On Mon, Jul 21, 2025 at 02:10:23PM +0800, Pan Deng wrote: > When running a multi-instance FFmpeg workload on an HCC system, significant > cache line contention is observed around `cpupri_vec->count` and `mask` in > struct root_domain. > > The SUT is a 2-socket machine with 240 physical cores and 480 logical > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 > with FIFO scheduling. FPS is used as score. > > perf c2c tool reveals: > root_domain cache line 3: > - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored > and contends with other fields, since counts[0] is more frequently > updated than others along with a rt task enqueues an empty runq or > dequeues from a non-overloaded runq. > - cycles per load: ~10K to 59K > > cpupri's last cache line: > - `cpupri_vec->count` and `mask` contends. The transcoding threads use > rt pri 99, so that the contention occurs in the end. > - cycles per load: ~1.5K to 10.5K > > This change mitigates `cpupri_vec->count`, `mask` related contentions by > separating each count and mask into different cache lines. Right. > Note: The side effect of this change is that struct cpupri size is > increased from 26 cache lines to 203 cache lines. That is pretty horrible, but probably unavoidable. > An alternative implementation of this patch could be separating `counts` > and `masks` into 2 vectors in cpupri_vec (counts[] and masks[]), and > add two paddings: > 1. Between counts[0] and counts[1], since counts[0] is more frequently > updated than others. That is completely workload specific; it is a direct consequence of your (probably busted) priority assignment scheme. > 2. Between the two vectors, since counts[] is read-write access while > masks[] is read access when it stores pointers. > > The alternative introduces the complexity of 31+/21- LoC changes, > it achieves almost the same performance, at the same time, struct cpupri > size is reduced from 26 cache lines to 21 cache lines. That is not an alternative, since it very specifically only deals with fifo-99 contention. > --- > kernel/sched/cpupri.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h > index d6cba0020064..245b0fa626be 100644 > --- a/kernel/sched/cpupri.h > +++ b/kernel/sched/cpupri.h > @@ -9,7 +9,7 @@ > > struct cpupri_vec { > atomic_t count; > - cpumask_var_t mask; > + cpumask_var_t mask ____cacheline_aligned; > }; At the very least this needs a comment, explaining the what and how of it. ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention 2026-03-20 10:09 ` Peter Zijlstra @ 2026-03-24 9:36 ` Deng, Pan 2026-03-24 12:11 ` Peter Zijlstra 0 siblings, 1 reply; 41+ messages in thread From: Deng, Pan @ 2026-03-24 9:36 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo@kernel.org, rostedt@goodmis.org, linux-kernel@vger.kernel.org, Li, Tianyou, tim.c.chen@linux.intel.com, Chen, Yu C > On Mon, Jul 21, 2025 at 02:10:23PM +0800, Pan Deng wrote: > > When running a multi-instance FFmpeg workload on an HCC system, > significant > > cache line contention is observed around `cpupri_vec->count` and `mask` in > > struct root_domain. > > > > The SUT is a 2-socket machine with 240 physical cores and 480 logical > > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores > > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 > > with FIFO scheduling. FPS is used as score. > > > > perf c2c tool reveals: > > root_domain cache line 3: > > - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored > > and contends with other fields, since counts[0] is more frequently > > updated than others along with a rt task enqueues an empty runq or > > dequeues from a non-overloaded runq. > > - cycles per load: ~10K to 59K > > > > cpupri's last cache line: > > - `cpupri_vec->count` and `mask` contends. The transcoding threads use > > rt pri 99, so that the contention occurs in the end. > > - cycles per load: ~1.5K to 10.5K > > > > This change mitigates `cpupri_vec->count`, `mask` related contentions by > > separating each count and mask into different cache lines. > > Right. > > > Note: The side effect of this change is that struct cpupri size is > > increased from 26 cache lines to 203 cache lines. > > That is pretty horrible, but probably unavoidable. > > > An alternative implementation of this patch could be separating `counts` > > and `masks` into 2 vectors in cpupri_vec (counts[] and masks[]), and > > add two paddings: > > 1. Between counts[0] and counts[1], since counts[0] is more frequently > > updated than others. > > That is completely workload specific; it is a direct consequence of your > (probably busted) priority assignment scheme. > > > 2. Between the two vectors, since counts[] is read-write access while > > masks[] is read access when it stores pointers. > > > > The alternative introduces the complexity of 31+/21- LoC changes, > > it achieves almost the same performance, at the same time, struct cpupri > > size is reduced from 26 cache lines to 21 cache lines. > > That is not an alternative, since it very specifically only deals with > fifo-99 contention. > > > --- > > kernel/sched/cpupri.h | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h > > index d6cba0020064..245b0fa626be 100644 > > --- a/kernel/sched/cpupri.h > > +++ b/kernel/sched/cpupri.h > > @@ -9,7 +9,7 @@ > > > > struct cpupri_vec { > > atomic_t count; > > - cpumask_var_t mask; > > + cpumask_var_t mask ____cacheline_aligned; > > }; > > At the very least this needs a comment, explaining the what and how of > it. Hi Peter, Thank you very much for helping look at this patch series. Before digging into the details, please let me briefly describe the structure of this patch set. Each patch builds incrementally on the previous ones, with patch 1 improving performance by 11%, patch 1+2 improving by 12%, patch 1+2+3 improving by 13%, and patch 1+2+3+4 by 16%. Since patch 1 gives the most benefit and is simple enough, we are planning to address the first issue in patch 1 and try to push this patch first, then address your comments in remained patches. We'll investigate a more generic method to solve the global contention issue as you proposed in patch 3 and patch 4, and we are planning to do that on multi-LLC system as well(Intel and AMD). Regarding this patch, yes, using cacheline aligned could increase potential memory usage. After internal discussion, we are thinking of an alternative method to mitigate the waste of memory usage, that is, using kmalloc() to allocate count in a different memory space rather than placing the count and cpumask together in this structure. The rationale is that, writing to address pointed by the counter and reading the address from cpumask is isolated in different memory space which could reduce the ratio of cache false sharing, besides, kmalloc() based on slub/slab could place the objects in different cache lines to reduce the cache contention. The drawback of dynamic allocation counter is that, we have to maintain the life cycle of the counters. Could you please advise if sticking with current cache_align attribute method or using kmalloc() is preferred? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention 2026-03-24 9:36 ` Deng, Pan @ 2026-03-24 12:11 ` Peter Zijlstra 2026-03-27 10:17 ` Deng, Pan 0 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2026-03-24 12:11 UTC (permalink / raw) To: Deng, Pan Cc: mingo@kernel.org, rostedt@goodmis.org, linux-kernel@vger.kernel.org, Li, Tianyou, tim.c.chen@linux.intel.com, Chen, Yu C On Tue, Mar 24, 2026 at 09:36:14AM +0000, Deng, Pan wrote: > Regarding this patch, yes, using cacheline aligned could increase potential > memory usage. > After internal discussion, we are thinking of an alternative method to > mitigate the waste of memory usage, that is, using kmalloc() to allocate > count in a different memory space rather than placing the count and > cpumask together in this structure. The rationale is that, writing to > address pointed by the counter and reading the address from cpumask > is isolated in different memory space which could reduce the ratio of > cache false sharing, besides, kmalloc() based on slub/slab could place > the objects in different cache lines to reduce the cache contention. > The drawback of dynamic allocation counter is that, we have to maintain > the life cycle of the counters. > Could you please advise if sticking with current cache_align attribute > method or using kmalloc() is preferred? Well, you'd have to allocate a full cacheline anyway. If you allocate N 4 byte (counter) objects, there's a fair chance they end up in the same cacheline (its a SLAB after all) and then you're back to having a ton of false sharing. Anyway, for you specific workload, why isn't partitioning a viable solution? It would not need any kernel modifications and would get rid of the contention entirely. ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention 2026-03-24 12:11 ` Peter Zijlstra @ 2026-03-27 10:17 ` Deng, Pan 2026-04-02 10:37 ` Deng, Pan 2026-04-02 10:43 ` Peter Zijlstra 0 siblings, 2 replies; 41+ messages in thread From: Deng, Pan @ 2026-03-27 10:17 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo@kernel.org, rostedt@goodmis.org, linux-kernel@vger.kernel.org, Li, Tianyou, tim.c.chen@linux.intel.com, Chen, Yu C > > On Tue, Mar 24, 2026 at 09:36:14AM +0000, Deng, Pan wrote: > > > Regarding this patch, yes, using cacheline aligned could increase potential > > memory usage. > > After internal discussion, we are thinking of an alternative method to > > mitigate the waste of memory usage, that is, using kmalloc() to allocate > > count in a different memory space rather than placing the count and > > cpumask together in this structure. The rationale is that, writing to > > address pointed by the counter and reading the address from cpumask > > is isolated in different memory space which could reduce the ratio of > > cache false sharing, besides, kmalloc() based on slub/slab could place > > the objects in different cache lines to reduce the cache contention. > > The drawback of dynamic allocation counter is that, we have to maintain > > the life cycle of the counters. > > Could you please advise if sticking with current cache_align attribute > > method or using kmalloc() is preferred? > > Well, you'd have to allocate a full cacheline anyway. If you allocate N > 4 byte (counter) objects, there's a fair chance they end up in the same > cacheline (its a SLAB after all) and then you're back to having a ton of > false sharing. > > Anyway, for you specific workload, why isn't partitioning a viable > solution? It would not need any kernel modifications and would get rid > of the contention entirely. Thank you very much for pointing this out. We understand cpuset partitioning would eliminate the contention. However, in managed container platforms (e.g., Kubernetes), users can obtain RT capabilities for their workloads via CAP_SYS_NICE, but they don't have host-level privileges to create cpuset partitions. Besides the cache line align approach, regarding both the contention and memory overhead, would it be possible to consider the alternative approach as follow: 1. Use {counts[], masks[]} instead of vec[{count, mask}] 2. Separate counts[0] (CPUPRI_NORMAL), who experiences both heavy write and read traffic. Writes: RT task lifecycle operations (enqueue on empty runqueue, dequeue from non-overloaded runqueue) frequently update the normal priority count. Reads: RT tasks searching for available CPUs scan from low to high priority, with counts[0] being checked at the start of every search iteration. So that even if workloads used lower RT priorities (e.g., RT pri 49 instead of 99), counts[0] contention would still be heavy, not specific to the pri-99 workload configuration. 3. Separate masks from counts to ensure no contention between them. With the change struct cpupri size can be reduced from 26 cache lines to 21 cache lines, which saves more memory in cpuset partitioning scenarios. code change looks like this: --- diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c index 42c40cfdf836..1e333e6edb1e 100644 --- a/kernel/sched/cpupri.c +++ b/kernel/sched/cpupri.c @@ -64,16 +64,38 @@ static int convert_prio(int prio) return cpupri; } +/* + * Get pointer to count for given priority index. + * + * Skip padding after counts[0] for idx > 0 to access the correct location. + */ +static inline atomic_t *cpupri_count(struct cpupri *cp, int idx) +{ + if (idx > 0) + idx += CPUPRI_COUNT0_PADDING; + + return &cp->pri_to_cpu.counts[idx]; +} + +/* + * Get pointer to mask for given priority index. + */ +static inline cpumask_var_t cpupri_mask(struct cpupri *cp, int idx) +{ + return cp->pri_to_cpu.masks[idx]; +} + static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p, struct cpumask *lowest_mask, int idx) { - struct cpupri_vec *vec = &cp->pri_to_cpu[idx]; + cpumask_var_t cpu_mask = cpupri_mask(cp, idx); + int skip = 0; - if (!atomic_read(&(vec)->count)) + if (!atomic_read(cpupri_count(cp, idx))) skip = 1; /* - * When looking at the vector, we need to read the counter, + * When looking at the vector, we need to read the count, * do a memory barrier, then read the mask. * * Note: This is still all racy, but we can deal with it. @@ -96,18 +118,18 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p, if (skip) return 0; - if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids) + if (cpumask_any_and(&p->cpus_mask, cpu_mask) >= nr_cpu_ids) return 0; if (lowest_mask) { - cpumask_and(lowest_mask, &p->cpus_mask, vec->mask); + cpumask_and(lowest_mask, &p->cpus_mask, cpu_mask); cpumask_and(lowest_mask, lowest_mask, cpu_active_mask); /* * We have to ensure that we have at least one bit * still set in the array, since the map could have * been concurrently emptied between the first and - * second reads of vec->mask. If we hit this + * second reads of cpu_mask. If we hit this * condition, simply act as though we never hit this * priority level and continue on. */ @@ -227,23 +249,19 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri) * cpu being missed by the priority loop in cpupri_find. */ if (likely(newpri != CPUPRI_INVALID)) { - struct cpupri_vec *vec = &cp->pri_to_cpu[newpri]; - - cpumask_set_cpu(cpu, vec->mask); + cpumask_set_cpu(cpu, cpupri_mask(cp, newpri)); /* * When adding a new vector, we update the mask first, * do a write memory barrier, and then update the count, to * make sure the vector is visible when count is set. */ smp_mb__before_atomic(); - atomic_inc(&(vec)->count); + atomic_inc(cpupri_count(cp, newpri)); do_mb = 1; } if (likely(oldpri != CPUPRI_INVALID)) { - struct cpupri_vec *vec = &cp->pri_to_cpu[oldpri]; - /* - * Because the order of modification of the vec->count + * Because the order of modification of the cpu count * is important, we must make sure that the update * of the new prio is seen before we decrement the * old prio. This makes sure that the loop sees @@ -252,18 +270,18 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri) * priority, as that will trigger an rt pull anyway. * * We only need to do a memory barrier if we updated - * the new priority vec. + * cpu count or mask of the new priority. */ if (do_mb) smp_mb__after_atomic(); /* - * When removing from the vector, we decrement the counter first + * When removing from the vector, we decrement the count first * do a memory barrier and then clear the mask. */ - atomic_dec(&(vec)->count); + atomic_dec(cpupri_count(cp, oldpri)); smp_mb__after_atomic(); - cpumask_clear_cpu(cpu, vec->mask); + cpumask_clear_cpu(cpu, cpupri_mask(cp, oldpri)); } *currpri = newpri; @@ -280,10 +298,8 @@ int cpupri_init(struct cpupri *cp) int i; for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) { - struct cpupri_vec *vec = &cp->pri_to_cpu[i]; - - atomic_set(&vec->count, 0); - if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL)) + atomic_set(cpupri_count(cp, i), 0); + if (!zalloc_cpumask_var(&cp->pri_to_cpu.masks[i], GFP_KERNEL)) goto cleanup; } @@ -298,7 +314,7 @@ int cpupri_init(struct cpupri *cp) cleanup: for (i--; i >= 0; i--) - free_cpumask_var(cp->pri_to_cpu[i].mask); + free_cpumask_var(cpupri_mask(cp, i)); return -ENOMEM; } @@ -312,5 +328,5 @@ void cpupri_cleanup(struct cpupri *cp) kfree(cp->cpu_to_pri); for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) - free_cpumask_var(cp->pri_to_cpu[i].mask); + free_cpumask_var(cpupri_mask(cp, i)); } diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h index d6cba0020064..9041e2ffb3f3 100644 --- a/kernel/sched/cpupri.h +++ b/kernel/sched/cpupri.h @@ -1,5 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ +#include <linux/cache.h> + #define CPUPRI_NR_PRIORITIES (MAX_RT_PRIO+1) #define CPUPRI_INVALID -1 @@ -7,13 +9,73 @@ /* values 1-99 are for RT1-RT99 priorities */ #define CPUPRI_HIGHER 100 +/* + * Padding to isolate counts[0] (CPUPRI_NORMAL) into its own cacheline. + * + * On 64-byte cacheline systems: (64 / 4) - 1 = 15 padding slots + * This places counts[0] alone in cacheline 0, counts[1..N] in subsequent lines. + */ +#define CPUPRI_COUNT0_PADDING ((SMP_CACHE_BYTES / sizeof(atomic_t)) - 1) + +/* Total count array size including padding after counts[0] */ +#define CPUPRI_COUNT_ARRAY_SIZE (CPUPRI_NR_PRIORITIES + CPUPRI_COUNT0_PADDING) + +/* + * Padding bytes to align mask vector to cacheline boundary. + * Ensures no false sharing between counts[] and masks[]. + */ +#define CPUPRI_VEC_PADDING \ + (SMP_CACHE_BYTES - \ + (CPUPRI_COUNT_ARRAY_SIZE * sizeof(atomic_t) % SMP_CACHE_BYTES)) + struct cpupri_vec { - atomic_t count; - cpumask_var_t mask; + /* + * Count vector with strategic padding to prevent false sharing. + * + * Layout (64-byte cachelines): + * Cacheline 0: counts[0] (CPUPRI_NORMAL) + 60 bytes padding + * Cacheline 1+: counts[1..100] (RT priorities 1-99, CPUPRI_HIGHER) + * + * counts[0] experiences the heaviest read and write traffic: + * - Write: RT task lifecycle operations (enqueue on empty runqueue, + dequeue from non-overloaded runqueue) frequently update the + normal priority count. + * - Read: RT tasks searching for available CPUs scan from low to high + priority, with counts[0] being checked at the start of every + search iteration. + * Isolating counts[0] in its own cacheline prevents contention with other + * priority counts during concurrent search and update operations. + */ + atomic_t counts[CPUPRI_COUNT_ARRAY_SIZE]; + + /* + * Padding to separate count and mask vectors. + * + * Prevents false sharing between: + * - counts[] (read-write, hot path in cpupri_set) + * - masks[] (read-mostly, accessed in cpupri_find) + */ + char padding[CPUPRI_VEC_PADDING]; + + /* + * CPU mask vector. + * + * Either stores: + * - Pointers to dynamically allocated cpumasks (read-mostly after init) + * - Inline cpumasks (if !CPUMASK_OFFSTACK) + */ + cpumask_var_t masks[CPUPRI_NR_PRIORITIES]; }; struct cpupri { - struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES]; + /* + * Priority-to-CPU mapping. + * + * Single cpupri_vec structure containing all counts and masks, + * rather than 101 separate cpupri_vec elements. This reduces + * memory overhead from ~26 to ~21 cachelines. + */ + struct cpupri_vec pri_to_cpu; int *cpu_to_pri; }; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 475bb5998295..2263237cdeb0 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1014,7 +1014,7 @@ struct root_domain { * one runnable RT task. */ cpumask_var_t rto_mask; - struct cpupri cpupri; + struct cpupri cpupri ____cacheline_aligned; /* * NULL-terminated list of performance domains intersecting with the ^ permalink raw reply related [flat|nested] 41+ messages in thread
* RE: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention 2026-03-27 10:17 ` Deng, Pan @ 2026-04-02 10:37 ` Deng, Pan 2026-04-02 10:43 ` Peter Zijlstra 1 sibling, 0 replies; 41+ messages in thread From: Deng, Pan @ 2026-04-02 10:37 UTC (permalink / raw) To: Peter Zijlstra, Steven Rostedt Cc: mingo@kernel.org, linux-kernel@vger.kernel.org, Li, Tianyou, tim.c.chen@linux.intel.com, Chen, Yu C > + atomic_t counts[CPUPRI_COUNT_ARRAY_SIZE]; > + > + /* > + * Padding to separate count and mask vectors. > + * > + * Prevents false sharing between: > + * - counts[] (read-write, hot path in cpupri_set) > + * - masks[] (read-mostly, accessed in cpupri_find) > + */ > + char padding[CPUPRI_VEC_PADDING]; > + > + /* > + * CPU mask vector. > + * > + * Either stores: > + * - Pointers to dynamically allocated cpumasks (read-mostly after init) > + * - Inline cpumasks (if !CPUMASK_OFFSTACK) > + */ > + cpumask_var_t masks[CPUPRI_NR_PRIORITIES]; > }; > > struct cpupri { > - struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES]; > + /* > + * Priority-to-CPU mapping. > + * > + * Single cpupri_vec structure containing all counts and masks, > + * rather than 101 separate cpupri_vec elements. This reduces > + * memory overhead from ~26 to ~21 cachelines. > + */ > + struct cpupri_vec pri_to_cpu; > int *cpu_to_pri; > }; > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 475bb5998295..2263237cdeb0 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -1014,7 +1014,7 @@ struct root_domain { > * one runnable RT task. > */ > cpumask_var_t rto_mask; > - struct cpupri cpupri; > + struct cpupri cpupri ____cacheline_aligned; > > /* > * NULL-terminated list of performance domains intersecting with the Peter and Steven, Here we consider two approaches: The cache-line alignment approach is simple to implement but increases memory usage. The alternative approach (separating counts and masks, with padding after counts[0]) reduces memory footprint at the cost of slightly higher complexity. What is your opinion? thanks a lot! ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention 2026-03-27 10:17 ` Deng, Pan 2026-04-02 10:37 ` Deng, Pan @ 2026-04-02 10:43 ` Peter Zijlstra 1 sibling, 0 replies; 41+ messages in thread From: Peter Zijlstra @ 2026-04-02 10:43 UTC (permalink / raw) To: Deng, Pan Cc: mingo@kernel.org, rostedt@goodmis.org, linux-kernel@vger.kernel.org, Li, Tianyou, tim.c.chen@linux.intel.com, Chen, Yu C On Fri, Mar 27, 2026 at 10:17:13AM +0000, Deng, Pan wrote: > > > > On Tue, Mar 24, 2026 at 09:36:14AM +0000, Deng, Pan wrote: > > > > > Regarding this patch, yes, using cacheline aligned could increase potential > > > memory usage. > > > After internal discussion, we are thinking of an alternative method to > > > mitigate the waste of memory usage, that is, using kmalloc() to allocate > > > count in a different memory space rather than placing the count and > > > cpumask together in this structure. The rationale is that, writing to > > > address pointed by the counter and reading the address from cpumask > > > is isolated in different memory space which could reduce the ratio of > > > cache false sharing, besides, kmalloc() based on slub/slab could place > > > the objects in different cache lines to reduce the cache contention. > > > The drawback of dynamic allocation counter is that, we have to maintain > > > the life cycle of the counters. > > > Could you please advise if sticking with current cache_align attribute > > > method or using kmalloc() is preferred? > > > > Well, you'd have to allocate a full cacheline anyway. If you allocate N > > 4 byte (counter) objects, there's a fair chance they end up in the same > > cacheline (its a SLAB after all) and then you're back to having a ton of > > false sharing. > > > > Anyway, for you specific workload, why isn't partitioning a viable > > solution? It would not need any kernel modifications and would get rid > > of the contention entirely. > > Thank you very much for pointing this out. > > We understand cpuset partitioning would eliminate the contention. > However, in managed container platforms (e.g., Kubernetes), users can > obtain RT capabilities for their workloads via CAP_SYS_NICE, but they > don't have host-level privileges to create cpuset partitions. So because Kubernetes is shit, you're going to patch the kernel? Isn't that backwards? Should you not instead try and fix this kubernetes thing? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention 2025-07-21 6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng 2026-03-20 10:09 ` Peter Zijlstra @ 2026-04-08 10:16 ` Chen, Yu C 2026-04-09 11:47 ` Deng, Pan 1 sibling, 1 reply; 41+ messages in thread From: Chen, Yu C @ 2026-04-08 10:16 UTC (permalink / raw) To: Pan Deng; +Cc: linux-kernel, tianyou.li, tim.c.chen, peterz, mingo On 7/21/2025 2:10 PM, Pan Deng wrote: > When running a multi-instance FFmpeg workload on an HCC system, significant > cache line contention is observed around `cpupri_vec->count` and `mask` in > struct root_domain. > > The SUT is a 2-socket machine with 240 physical cores and 480 logical > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 > with FIFO scheduling. FPS is used as score. > [ ... ] > As a result: > - FPS improves by ~11% > - Kernel cycles% drops from ~20% to ~11% > - `count` and `mask` related cache line contention is mitigated, perf c2c > shows root_domain cache line 3 `cycles per load` drops from ~10K-59K > to ~0.5K-8K, cpupri's last cache line no longer appears in the report. > - stress-ng cyclic benchmark is improved ~31.4%, command: > stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \ > --timeout 30 --minimize --metrics > - rt-tests/pi_stress is improved ~76.5%, command: > rt-tests/pi_stress -D 30 -g $(($(nproc) / 2)) > According to your test results above, this original proposal seems simple enough. It provides a general benefit, not only for FFmpeg workloads with "unusual" CPU affinity settings, but also for other common workloads that do not use CPU affinity or partitioning. I still prefer this proposal. Later we can rebase patch 4 on top of sbm to see if it brings further improvements. patch 1 and patch 4 could form a patch series IMHO. thanks, Chenyu > diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h > index d6cba0020064..245b0fa626be 100644 > --- a/kernel/sched/cpupri.h > +++ b/kernel/sched/cpupri.h > @@ -9,7 +9,7 @@ > > struct cpupri_vec { > atomic_t count; > - cpumask_var_t mask; > + cpumask_var_t mask ____cacheline_aligned; > }; > > struct cpupri { ^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention 2026-04-08 10:16 ` Chen, Yu C @ 2026-04-09 11:47 ` Deng, Pan 0 siblings, 0 replies; 41+ messages in thread From: Deng, Pan @ 2026-04-09 11:47 UTC (permalink / raw) To: Chen, Yu C, peterz@infradead.org, Steven Rostedt Cc: linux-kernel@vger.kernel.org, Li, Tianyou, tim.c.chen@linux.intel.com, mingo@kernel.org > According to your test results above, this original proposal seems > simple enough. It provides a general benefit, not only for FFmpeg workloads > with "unusual" CPU affinity settings, but also for other common workloads > that do not use CPU affinity or partitioning. Yes, exactly. FFmpeg and K8s are just example scenarios - the optimization benefits any workload with RT thread contention. For instance, running cyclictest on a 2-socket, 384-logical-core system: "cyclictest -t -i200 -h 32 -m -p 95 -q" This patch reduces both mean and max latency by at least 40%. > I still prefer this proposal. Later we can rebase patch 4 on top of sbm > to see if it brings further improvements. patch 1 and patch 4 could form a > patch series IMHO. Thank you for the feedback. I agree that patch 1 and patch 4 work well together. Regarding the sbm discussion: we've observed promising results in our sbm experiments, and I believe rebasing patch 4 on top of sbm would likely show further improvements beyond the per-NUMA implementation. I'll try this once the sbm implementation stabilizes. Per Peter's previous request, I'm planning to add comments like this: /* * Separate mask to a different cacheline to mitigate contention * between count (read-write) and mask (read-mostly when storing * pointers). This alignment increases root_domain size by ~11KB, * but eliminates cache line bouncing between cpupri_set() writers * and cpupri_find_fitness() readers under heavy RT workloads. * * Memory overhead considerations: * - Systems with cpuset partitions: each partition's root_domain is * dynamically allocated (kalloc). The ~11KB overhead per partition * scales with partition count, acceptable on servers using partitions. * - Systems without partitions: only the static def_root_domain incurs * the overhead, which is manageable for typical use. * * Additionally, this cacheline alignment ensures cpupri starts at a * cacheline boundary, eliminating false sharing with root_domain's * preceding fields (rto_mask, rto_loop_next, rto_loop_start). */ cpumask_var_t mask ____cacheline_aligned_in_smp; Since this optimization is independent of the sbm work, would it be possible to review this patch first? That would allow the sbm-related improvements (patch 4) to build on top of this foundation once they're ready. Best Regards Pan ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention 2025-07-21 6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng 2025-07-21 6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng @ 2025-07-21 6:10 ` Pan Deng 2026-03-20 10:18 ` Peter Zijlstra 2025-07-21 6:10 ` [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng ` (2 subsequent siblings) 4 siblings, 1 reply; 41+ messages in thread From: Pan Deng @ 2025-07-21 6:10 UTC (permalink / raw) To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng When running a multi-instance FFmpeg workload on HCC system, significant contention is observed in root_domain cacheline 1 and 3. The SUT is a 2-socket machine with 240 physical cores and 480 logical CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 with FIFO scheduling. FPS is used as score. perf c2c tool reveals (sorted by contention severity): root_domain cache line 3: - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored, since counts[0] is more frequently updated than others along with a rt task enqueues an empty runq or dequeues from a non-overloaded runq. - `rto_mask` (0x30) is heavily loaded - `rto_loop_next` (0x24) and `rto_loop_start` (0x28) are frequently stored - `rto_push_work` (0x0) and `rto_lock` (0x18) are lightly accessed - cycles per load: ~10K to 59K root_domain cache line 1: - `rto_count` (0x4) is frequently loaded/stored - `overloaded` (0x28) is heavily loaded - cycles per load: ~2.8K to 44K: This change adjusts the layout of `root_domain` to isolate these contended fields across separate cache lines: 1. `rto_count` remains in the 1st cache line; `overloaded` and `overutilized` are moved to the last cache line 2. `rto_push_work` is placed in the 2nd cache line 3. `rto_loop_start`, `rto_loop_next`, and `rto_lock` remain in the 3rd cache line; `rto_mask` is moved near `pd` in the penultimate cache line 4. `cpupri` starts at the 4th cache line to prevent `pri_to_cpu[0].count` contending with fields in cache line 3. With this change: - FPS improves by ~5% - Kernel cycles% drops from ~20% to ~17.7% - root_domain cache line 3 no longer appears in perf-c2c report - cycles per load of root_domain cache line 1 is reduced to from ~2.8K-44K to ~2.1K-2.7K - stress-ng cyclic benchmark is improved ~18.6%, command: stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \ --timeout 30 --minimize --metrics - rt-tests/pi_stress is improved ~4.7%, command: rt-tests/pi_stress -D 30 -g $(($(nproc) / 2)) According to the nature of the change, to my understanding, it doesn`t introduce any negative impact in other scenario. Note: This change increases the size of `root_domain` from 29 to 31 cache lines, it's considered acceptable since `root_domain` is a single global object. Appendix: 1. Current layout of contended data structure: struct root_domain { atomic_t refcount; /* 0 4 */ atomic_t rto_count; /* 4 4 */ struct callback_head rcu __attribute__((__aligned__(8)));/*8 16 */ cpumask_var_t span; /* 24 8 */ cpumask_var_t online; /* 32 8 */ bool overloaded; /* 40 1 */ bool overutilized; /* 41 1 */ /* XXX 6 bytes hole, try to pack */ cpumask_var_t dlo_mask; /* 48 8 */ atomic_t dlo_count; /* 56 4 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dl_bw dl_bw; /* 64 24 */ struct cpudl cpudl; /* 88 24 */ u64 visit_gen; /* 112 8 */ struct irq_work rto_push_work; /* 120 32 */ /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */ raw_spinlock_t rto_lock; /* 152 4 */ int rto_loop; /* 156 4 */ int rto_cpu; /* 160 4 */ atomic_t rto_loop_next; /* 164 4 */ atomic_t rto_loop_start; /* 168 4 */ /* XXX 4 bytes hole, try to pack */ cpumask_var_t rto_mask; /* 176 8 */ /* --- cacheline 3 boundary (192 bytes) was 8 bytes hence --- */ struct cpupri cpupri; /* 184 1624 */ ... } __attribute__((__aligned__(8))); 2. Perf c2c report of root_domain cache line 3: ------- ------- ------ ------ ------ ------ ------------------------ Rmt Lcl Store Data Load Total Symbol Hitm% Hitm% L1 Hit% offset cycles records ------- ------- ------ ------ ------ ------ ------------------------ 353 44 62 0xff14d42c400e3880 ------- ------- ------ ------ ------ ------ ------------------------ 0.00% 2.27% 0.00% 0x0 21683 6 __flush_smp_call_function_ 0.00% 2.27% 0.00% 0x0 22294 5 __flush_smp_call_function_ 0.28% 0.00% 0.00% 0x0 0 2 irq_work_queue_on 0.28% 0.00% 0.00% 0x0 27824 4 irq_work_single 0.00% 0.00% 1.61% 0x0 28151 6 irq_work_queue_on 0.57% 0.00% 0.00% 0x18 21822 8 native_queued_spin_lock_sl 0.28% 2.27% 0.00% 0x18 16101 10 native_queued_spin_lock_sl 0.57% 0.00% 0.00% 0x18 33199 5 native_queued_spin_lock_sl 0.00% 0.00% 1.61% 0x18 10908 32 _raw_spin_lock 0.00% 0.00% 1.61% 0x18 59770 2 _raw_spin_lock 0.00% 0.00% 1.61% 0x18 0 1 _raw_spin_unlock 1.42% 0.00% 0.00% 0x20 12918 20 pull_rt_task 0.85% 0.00% 25.81% 0x24 31123 199 pull_rt_task 0.85% 0.00% 3.23% 0x24 38218 24 pull_rt_task 0.57% 4.55% 19.35% 0x28 30558 207 pull_rt_task 0.28% 0.00% 0.00% 0x28 55504 10 pull_rt_task 18.70% 18.18% 0.00% 0x30 26438 291 dequeue_pushable_task 17.28% 22.73% 0.00% 0x30 29347 281 enqueue_pushable_task 1.70% 2.27% 0.00% 0x30 12819 31 enqueue_pushable_task 0.28% 0.00% 0.00% 0x30 17726 18 dequeue_pushable_task 34.56% 29.55% 0.00% 0x38 25509 527 cpupri_find_fitness 13.88% 11.36% 24.19% 0x38 30654 342 cpupri_set 3.12% 2.27% 0.00% 0x38 18093 39 cpupri_set 1.70% 0.00% 0.00% 0x38 37661 52 cpupri_find_fitness 1.42% 2.27% 19.35% 0x38 31110 211 cpupri_set 1.42% 0.00% 1.61% 0x38 45035 31 cpupri_set 3. Perf c2c report of root_domain cache line 1: ------- ------- ------ ------ ------ ------ ------------------------ Rmt Lcl Store Data Load Total Symbol Hitm% Hitm% L1 Hit% offset cycles records ------- ------- ------ ------ ------ ------ ------------------------ 231 43 48 0xff14d42c400e3800 ------- ------- ------ ------ ------ ------ ------------------------ 22.51% 18.60% 0.00% 0x4 5041 247 pull_rt_task 5.63% 2.33% 45.83% 0x4 6995 315 dequeue_pushable_task 3.90% 4.65% 54.17% 0x4 6587 370 enqueue_pushable_task 0.43% 0.00% 0.00% 0x4 17111 4 enqueue_pushable_task 0.43% 0.00% 0.00% 0x4 44062 4 dequeue_pushable_task 32.03% 27.91% 0.00% 0x28 6393 285 enqueue_task_rt 16.45% 27.91% 0.00% 0x28 5534 139 sched_balance_newidle 14.72% 18.60% 0.00% 0x28 5287 110 dequeue_task_rt 3.46% 0.00% 0.00% 0x28 2820 25 enqueue_task_fair 0.43% 0.00% 0.00% 0x28 220 3 enqueue_task_stop Signed-off-by: Pan Deng <pan.deng@intel.com> Reviewed-by: Tianyou Li <tianyou.li@intel.com> Reviewed-by: Chen Yu <yu.c.chen@intel.com> --- kernel/sched/sched.h | 52 +++++++++++++++++++++++--------------------- 1 file changed, 27 insertions(+), 25 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 83e3aa917142..bc67806911f2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -968,24 +968,29 @@ struct root_domain { cpumask_var_t span; cpumask_var_t online; + atomic_t dlo_count; + struct dl_bw dl_bw; + struct cpudl cpudl; + +#ifdef HAVE_RT_PUSH_IPI /* - * Indicate pullable load on at least one CPU, e.g: - * - More than one runnable task - * - Running task is misfit + * For IPI pull requests, loop across the rto_mask. */ - bool overloaded; - - /* Indicate one or more CPUs over-utilized (tipping point) */ - bool overutilized; + struct irq_work rto_push_work; + raw_spinlock_t rto_lock; + /* These are only updated and read within rto_lock */ + int rto_loop; + int rto_cpu; + /* These atomics are updated outside of a lock */ + atomic_t rto_loop_next; + atomic_t rto_loop_start; +#endif /* * The bit corresponding to a CPU gets set here if such CPU has more * than one runnable -deadline task (as it is below for RT tasks). */ cpumask_var_t dlo_mask; - atomic_t dlo_count; - struct dl_bw dl_bw; - struct cpudl cpudl; /* * Indicate whether a root_domain's dl_bw has been checked or @@ -995,32 +1000,29 @@ struct root_domain { * that u64 is 'big enough'. So that shouldn't be a concern. */ u64 visit_cookie; + struct cpupri cpupri ____cacheline_aligned; -#ifdef HAVE_RT_PUSH_IPI /* - * For IPI pull requests, loop across the rto_mask. + * NULL-terminated list of performance domains intersecting with the + * CPUs of the rd. Protected by RCU. */ - struct irq_work rto_push_work; - raw_spinlock_t rto_lock; - /* These are only updated and read within rto_lock */ - int rto_loop; - int rto_cpu; - /* These atomics are updated outside of a lock */ - atomic_t rto_loop_next; - atomic_t rto_loop_start; -#endif + struct perf_domain __rcu *pd ____cacheline_aligned; + /* * The "RT overload" flag: it gets set if a CPU has more than * one runnable RT task. */ cpumask_var_t rto_mask; - struct cpupri cpupri; /* - * NULL-terminated list of performance domains intersecting with the - * CPUs of the rd. Protected by RCU. + * Indicate pullable load on at least one CPU, e.g: + * - More than one runnable task + * - Running task is misfit */ - struct perf_domain __rcu *pd; + bool overloaded ____cacheline_aligned; + + /* Indicate one or more CPUs over-utilized (tipping point) */ + bool overutilized; }; extern void init_defrootdomain(void); -- 2.43.5 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention 2025-07-21 6:10 ` [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng @ 2026-03-20 10:18 ` Peter Zijlstra 0 siblings, 0 replies; 41+ messages in thread From: Peter Zijlstra @ 2026-03-20 10:18 UTC (permalink / raw) To: Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen On Mon, Jul 21, 2025 at 02:10:24PM +0800, Pan Deng wrote: > When running a multi-instance FFmpeg workload on HCC system, significant > contention is observed in root_domain cacheline 1 and 3. What's a HCC? Hobby Computer Club? Google is telling me it is the most prevalent form of liver cancer, but I somehow doubt that is what you're on about. > The SUT is a 2-socket machine with 240 physical cores and 480 logical Satellite User Terminal? Subsea Umbilical Termination? Small Unit Transceiver? Single Unit Test? > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 > with FIFO scheduling. FPS is used as score. Yes yes, poorly configured systems hurt. > perf c2c tool reveals (sorted by contention severity): > root_domain cache line 3: > - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored, > since counts[0] is more frequently updated than others along with a > rt task enqueues an empty runq or dequeues from a non-overloaded runq. > - `rto_mask` (0x30) is heavily loaded > - `rto_loop_next` (0x24) and `rto_loop_start` (0x28) are frequently stored > - `rto_push_work` (0x0) and `rto_lock` (0x18) are lightly accessed > - cycles per load: ~10K to 59K > > root_domain cache line 1: > - `rto_count` (0x4) is frequently loaded/stored > - `overloaded` (0x28) is heavily loaded > - cycles per load: ~2.8K to 44K: > > This change adjusts the layout of `root_domain` to isolate these contended > fields across separate cache lines: > 1. `rto_count` remains in the 1st cache line; `overloaded` and > `overutilized` are moved to the last cache line > 2. `rto_push_work` is placed in the 2nd cache line > 3. `rto_loop_start`, `rto_loop_next`, and `rto_lock` remain in the 3rd > cache line; `rto_mask` is moved near `pd` in the penultimate cache line > 4. `cpupri` starts at the 4th cache line to prevent `pri_to_cpu[0].count` > contending with fields in cache line 3. > > With this change: > - FPS improves by ~5% > - Kernel cycles% drops from ~20% to ~17.7% > - root_domain cache line 3 no longer appears in perf-c2c report > - cycles per load of root_domain cache line 1 is reduced to from > ~2.8K-44K to ~2.1K-2.7K > - stress-ng cyclic benchmark is improved ~18.6%, command: > stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \ > --timeout 30 --minimize --metrics > - rt-tests/pi_stress is improved ~4.7%, command: > rt-tests/pi_stress -D 30 -g $(($(nproc) / 2)) > > According to the nature of the change, to my understanding, it doesn`t > introduce any negative impact in other scenario. > > Note: This change increases the size of `root_domain` from 29 to 31 cache > lines, it's considered acceptable since `root_domain` is a single global > object. Uhm, what? We're at 207 cachelines due to that previous patch, remember? A few more don't matter at this point I would guess. It doesn't actually apply anymore, but it needs the very same that previous patch did -- more comments. ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters 2025-07-21 6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng 2025-07-21 6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng 2025-07-21 6:10 ` [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng @ 2025-07-21 6:10 ` Pan Deng 2026-03-20 10:24 ` Peter Zijlstra 2025-07-21 6:10 ` [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng 2026-03-20 9:59 ` [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Peter Zijlstra 4 siblings, 1 reply; 41+ messages in thread From: Pan Deng @ 2025-07-21 6:10 UTC (permalink / raw) To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng When running a multi-instance FFmpeg workload on HCC system, significant contention is observed on root_domain `rto_count` and `overloaded` fields. The SUT is a 2-socket machine with 240 physical cores and 480 logical CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 with FIFO scheduling. FPS is used as score. perf c2c tool reveals: root_domain cache line 1: - `rto_count` (0x4) is frequently loaded/stored - `overloaded` (0x28) is heavily loaded - cycles per load: ~2.8K to 44K: A separate patch rearranges root_domain to place `overloaded` on a different cache line, but this alone is insufficient to resolve the contention on `rto_count`. As a complementary, this patch splits `rto_count` into per-numa-node counters to reduce the contention. With this change: - FPS improves by ~4% - Kernel cycles% drops from ~20% to ~18.6% - The cache line no longer appears in perf-c2c report - stress-ng cyclic benchmark is improved ~50.4%, command: stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \ --timeout 30 --minimize --metrics - rt-tests/pi_stress is improved ~5.1%, command: rt-tests/pi_stress -D 30 -g $(($(nproc) / 2)) Appendix: 1. Perf c2c report of root_domain cache line 1: ------- ------- ------ ------ ------ ------ ------------------------ Rmt Lcl Store Data Load Total Symbol Hitm% Hitm% L1 Hit% offset cycles records ------- ------- ------ ------ ------ ------ ------------------------ 231 43 48 0xff14d42c400e3800 ------- ------- ------ ------ ------ ------ ------------------------ 22.51% 18.60% 0.00% 0x4 5041 247 pull_rt_task 5.63% 2.33% 45.83% 0x4 6995 315 dequeue_pushable_task 3.90% 4.65% 54.17% 0x4 6587 370 enqueue_pushable_task 0.43% 0.00% 0.00% 0x4 17111 4 enqueue_pushable_task 0.43% 0.00% 0.00% 0x4 44062 4 dequeue_pushable_task 32.03% 27.91% 0.00% 0x28 6393 285 enqueue_task_rt 16.45% 27.91% 0.00% 0x28 5534 139 sched_balance_newidle 14.72% 18.60% 0.00% 0x28 5287 110 dequeue_task_rt 3.46% 0.00% 0.00% 0x28 2820 25 enqueue_task_fair 0.43% 0.00% 0.00% 0x28 220 3 enqueue_task_stop Signed-off-by: Pan Deng <pan.deng@intel.com> Reviewed-by: Tianyou Li <tianyou.li@intel.com> Reviewed-by: Chen Yu <yu.c.chen@intel.com> --- V1 -> V2: Fixed non CONFIG_SMP build issue --- kernel/sched/rt.c | 56 ++++++++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 9 ++++++- kernel/sched/topology.c | 7 ++++++ 3 files changed, 68 insertions(+), 4 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index e40422c37033..cbcfd3aa3439 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -337,9 +337,58 @@ static inline bool need_pull_rt_task(struct rq *rq, struct task_struct *prev) return rq->online && rq->rt.highest_prio.curr > prev->prio; } +int rto_counts_init(atomic_tp **rto_counts) +{ + int i; + atomic_tp *counts = kzalloc(nr_node_ids * sizeof(atomic_tp), GFP_KERNEL); + + if (!counts) + return -ENOMEM; + + for (i = 0; i < nr_node_ids; i++) { + counts[i] = kzalloc_node(sizeof(atomic_t), GFP_KERNEL, i); + + if (!counts[i]) + goto cleanup; + } + + *rto_counts = counts; + return 0; + +cleanup: + while (i--) + kfree(counts[i]); + + kfree(counts); + return -ENOMEM; +} + +void rto_counts_cleanup(atomic_tp *rto_counts) +{ + for (int i = 0; i < nr_node_ids; i++) + kfree(rto_counts[i]); + + kfree(rto_counts); +} + static inline int rt_overloaded(struct rq *rq) { - return atomic_read(&rq->rd->rto_count); + int count = 0; + int cur_node, nid; + + cur_node = numa_node_id(); + + for (int i = 0; i < nr_node_ids; i++) { + nid = (cur_node + i) % nr_node_ids; + count += atomic_read(rq->rd->rto_counts[nid]); + + // The caller only checks if it is 0 + // or 1, so that return once > 1 + if (count > 1) + return count; + } + + return count; } static inline void rt_set_overload(struct rq *rq) @@ -358,7 +407,7 @@ static inline void rt_set_overload(struct rq *rq) * Matched by the barrier in pull_rt_task(). */ smp_wmb(); - atomic_inc(&rq->rd->rto_count); + atomic_inc(rq->rd->rto_counts[cpu_to_node(rq->cpu)]); } static inline void rt_clear_overload(struct rq *rq) @@ -367,7 +416,7 @@ static inline void rt_clear_overload(struct rq *rq) return; /* the order here really doesn't matter */ - atomic_dec(&rq->rd->rto_count); + atomic_dec(rq->rd->rto_counts[cpu_to_node(rq->cpu)]); cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask); } @@ -443,6 +492,7 @@ static inline void dequeue_pushable_task(struct rq *rq, struct task_struct *p) static inline void rt_queue_push_tasks(struct rq *rq) { } + #endif /* CONFIG_SMP */ static void enqueue_top_rt_rq(struct rt_rq *rt_rq); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bc67806911f2..13fc3ac3381b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -953,6 +953,8 @@ struct perf_domain { struct rcu_head rcu; }; +typedef atomic_t *atomic_tp; + /* * We add the notion of a root-domain which will be used to define per-domain * variables. Each exclusive cpuset essentially defines an island domain by @@ -963,12 +965,15 @@ struct perf_domain { */ struct root_domain { atomic_t refcount; - atomic_t rto_count; struct rcu_head rcu; cpumask_var_t span; cpumask_var_t online; atomic_t dlo_count; + + /* rto_count per node */ + atomic_tp *rto_counts; + struct dl_bw dl_bw; struct cpudl cpudl; @@ -1030,6 +1035,8 @@ extern int sched_init_domains(const struct cpumask *cpu_map); extern void rq_attach_root(struct rq *rq, struct root_domain *rd); extern void sched_get_rd(struct root_domain *rd); extern void sched_put_rd(struct root_domain *rd); +extern int rto_counts_init(atomic_tp **rto_counts); +extern void rto_counts_cleanup(atomic_tp *rto_counts); static inline int get_rd_overloaded(struct root_domain *rd) { diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b958fe48e020..166dc8177a44 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -457,6 +457,7 @@ static void free_rootdomain(struct rcu_head *rcu) { struct root_domain *rd = container_of(rcu, struct root_domain, rcu); + rto_counts_cleanup(rd->rto_counts); cpupri_cleanup(&rd->cpupri); cpudl_cleanup(&rd->cpudl); free_cpumask_var(rd->dlo_mask); @@ -549,8 +550,14 @@ static int init_rootdomain(struct root_domain *rd) if (cpupri_init(&rd->cpupri) != 0) goto free_cpudl; + + if (rto_counts_init(&rd->rto_counts) != 0) + goto free_cpupri; + return 0; +free_cpupri: + cpupri_cleanup(&rd->cpupri); free_cpudl: cpudl_cleanup(&rd->cpudl); free_rto_mask: -- 2.43.5 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters 2025-07-21 6:10 ` [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng @ 2026-03-20 10:24 ` Peter Zijlstra 2026-03-23 18:09 ` Tim Chen 0 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2026-03-20 10:24 UTC (permalink / raw) To: Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen On Mon, Jul 21, 2025 at 02:10:25PM +0800, Pan Deng wrote: > As a complementary, this patch splits > `rto_count` into per-numa-node counters to reduce the contention. Right... so Tim, didn't we have similar patches for task_group::load_avg or something like that? Whatever did happen there? Can we share common infra? Also since Tim is sitting on this LLC infrastructure, can you compare per-node and per-llc for this stuff? Somehow I'm thinking that a 2 socket 480 CPU system only has like 2 nodes and while splitting this will help some, that might not be excellent. Please test on both Intel and AMD systems, since AMD has more of these LLC things on. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters 2026-03-20 10:24 ` Peter Zijlstra @ 2026-03-23 18:09 ` Tim Chen 2026-03-24 12:16 ` Peter Zijlstra 0 siblings, 1 reply; 41+ messages in thread From: Tim Chen @ 2026-03-23 18:09 UTC (permalink / raw) To: Peter Zijlstra, Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, yu.c.chen On Fri, 2026-03-20 at 11:24 +0100, Peter Zijlstra wrote: > On Mon, Jul 21, 2025 at 02:10:25PM +0800, Pan Deng wrote: > > As a complementary, this patch splits > > `rto_count` into per-numa-node counters to reduce the contention. > > Right... so Tim, didn't we have similar patches for task_group::load_avg > or something like that? Whatever did happen there? Can we share common > infra? We did talk about introducing per NUMA counter for load_avg. We went with limiting the update rate of load_avg to not more than once per msec in commit 1528c661c24b4 to control the cache bounce. > > Also since Tim is sitting on this LLC infrastructure, can you compare > per-node and per-llc for this stuff? Somehow I'm thinking that a 2 > socket 480 CPU system only has like 2 nodes and while splitting this > will help some, that might not be excellent. You mean enhancing the per NUMA counter to per LLC? I think that makes sense to reduce the LLC cache bounce if there are multiple LLCs per NUMA node. Tim > > Please test on both Intel and AMD systems, since AMD has more of these > LLC things on. > > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters 2026-03-23 18:09 ` Tim Chen @ 2026-03-24 12:16 ` Peter Zijlstra 2026-03-24 22:40 ` Tim Chen 0 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2026-03-24 12:16 UTC (permalink / raw) To: Tim Chen; +Cc: Pan Deng, mingo, linux-kernel, tianyou.li, yu.c.chen, x86 On Mon, Mar 23, 2026 at 11:09:24AM -0700, Tim Chen wrote: > On Fri, 2026-03-20 at 11:24 +0100, Peter Zijlstra wrote: > > On Mon, Jul 21, 2025 at 02:10:25PM +0800, Pan Deng wrote: > > > As a complementary, this patch splits > > > `rto_count` into per-numa-node counters to reduce the contention. > > > > Right... so Tim, didn't we have similar patches for task_group::load_avg > > or something like that? Whatever did happen there? Can we share common > > infra? > > We did talk about introducing per NUMA counter for load_avg. We went with > limiting the update rate of load_avg to not more than once per msec > in commit 1528c661c24b4 to control the cache bounce. > > > > > Also since Tim is sitting on this LLC infrastructure, can you compare > > per-node and per-llc for this stuff? Somehow I'm thinking that a 2 > > socket 480 CPU system only has like 2 nodes and while splitting this > > will help some, that might not be excellent. > > You mean enhancing the per NUMA counter to per LLC? I think that makes > sense to reduce the LLC cache bounce if there are multiple LLCs per > NUMA node. Does that system have multiple LLCs? Realistically, it would probably improve things if we could split these giant stupid LLCs along the same lines SNC does. I still have the below terrible hack that I've been using to diagnose and test all these multi-llc patches/regressions etc. Funnily enough its been good enough to actually show some of the issues. --- Subject: x86/topology: Add paramter to split LLC From: Peter Zijlstra <peterz@infradead.org> Date: Thu Feb 19 12:11:16 CET 2026 Add a (debug) option to virtually split the LLC, no CAT involved, just fake topology. Used to test code that depends (either in behaviour or directly) on there being multiple LLC domains in a node. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- Documentation/admin-guide/kernel-parameters.txt | 12 ++++++++++++ arch/x86/include/asm/processor.h | 5 +++++ arch/x86/kernel/smpboot.c | 20 ++++++++++++++++++++ 3 files changed, 37 insertions(+) --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -7241,6 +7241,18 @@ Kernel parameters Not specifying this option is equivalent to spec_store_bypass_disable=auto. + split_llc= + [X86,EARLY] Split the LLC N-ways + + When set, the LLC is split this many ways by matching + 'core_id % n'. This is setup before SMP bringup and + used during SMP bringup before it knows the full + topology. If your core count doesn't nicely divide by + the number given, you get to keep the pieces. + + This is mostly a debug feature to emulate multiple LLCs + on hardware that only have a single LLC. + split_lock_detect= [X86] Enable split lock detection or bus lock detection --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -699,6 +699,11 @@ static inline u32 per_cpu_l2c_id(unsigne return per_cpu(cpu_info.topo.l2c_id, cpu); } +static inline u32 per_cpu_core_id(unsigned int cpu) +{ + return per_cpu(cpu_info.topo.core_id, cpu); +} + #ifdef CONFIG_CPU_SUP_AMD /* * Issue a DIV 0/1 insn to clear any division data from previous DIV --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -424,6 +424,21 @@ static const struct x86_cpu_id intel_cod {} }; +/* + * Allows splitting the LLC by matching 'core_id % split_llc'. + * + * This is mostly a debug hack to emulate systems with multiple LLCs per node + * on systems that do not naturally have this. + */ +static unsigned int split_llc = 0; + +static int __init split_llc_setup(char *str) +{ + get_option(&str, &split_llc); + return 0; +} +early_param("split_llc", split_llc_setup); + static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) { const struct x86_cpu_id *id = x86_match_cpu(intel_cod_cpu); @@ -438,6 +453,11 @@ static bool match_llc(struct cpuinfo_x86 if (per_cpu_llc_id(cpu1) != per_cpu_llc_id(cpu2)) return false; + if (split_llc && + (per_cpu_core_id(cpu1) % split_llc) != + (per_cpu_core_id(cpu2) % split_llc)) + return false; + /* * Allow the SNC topology without warning. Return of false * means 'c' does not share the LLC of 'o'. This will be ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters 2026-03-24 12:16 ` Peter Zijlstra @ 2026-03-24 22:40 ` Tim Chen 0 siblings, 0 replies; 41+ messages in thread From: Tim Chen @ 2026-03-24 22:40 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Pan Deng, mingo, linux-kernel, tianyou.li, yu.c.chen, x86 On Tue, 2026-03-24 at 13:16 +0100, Peter Zijlstra wrote: > On Mon, Mar 23, 2026 at 11:09:24AM -0700, Tim Chen wrote: > > On Fri, 2026-03-20 at 11:24 +0100, Peter Zijlstra wrote: > > > On Mon, Jul 21, 2025 at 02:10:25PM +0800, Pan Deng wrote: > > > > As a complementary, this patch splits > > > > `rto_count` into per-numa-node counters to reduce the contention. > > > > > > Right... so Tim, didn't we have similar patches for task_group::load_avg > > > or something like that? Whatever did happen there? Can we share common > > > infra? > > > > We did talk about introducing per NUMA counter for load_avg. We went with > > limiting the update rate of load_avg to not more than once per msec > > in commit 1528c661c24b4 to control the cache bounce. > > > > > > > > Also since Tim is sitting on this LLC infrastructure, can you compare > > > per-node and per-llc for this stuff? Somehow I'm thinking that a 2 > > > socket 480 CPU system only has like 2 nodes and while splitting this > > > will help some, that might not be excellent. > > > > You mean enhancing the per NUMA counter to per LLC? I think that makes > > sense to reduce the LLC cache bounce if there are multiple LLCs per > > NUMA node. > > Does that system have multiple LLCs? Realistically, it would probably > improve things if we could split these giant stupid LLCs along the same > lines SNC does. The system that Pan tested does not have multiple LLCs per node. But future Intel systems and current AMD systems do. So it make sense to start thinking about having a per LLC count infrastructure. We could create a per LLC counter library, kind of like the percpu counter we already have. We can leverage compact LLC id assignment in the cache aware scheduling patches to allocate arrays indexed by LLC id. The caveat is if such LLC count is used during early boot before LLCs are enumerated in the topology code, we may need to put do accounting in a global count, till the per LLC count gets enumerated and we know the right size of the LLC array. And we'll also need to check if new LLC come online or offline and handle things accordingly. That sounds reasonable? Tim > > I still have the below terrible hack that I've been using to diagnose > and test all these multi-llc patches/regressions etc. Funnily enough its > been good enough to actually show some of the issues. > > > > --- > Subject: x86/topology: Add paramter to split LLC > From: Peter Zijlstra <peterz@infradead.org> > Date: Thu Feb 19 12:11:16 CET 2026 > > Add a (debug) option to virtually split the LLC, no CAT involved, just fake > topology. Used to test code that depends (either in behaviour or directly) on > there being multiple LLC domains in a node. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > Documentation/admin-guide/kernel-parameters.txt | 12 ++++++++++++ > arch/x86/include/asm/processor.h | 5 +++++ > arch/x86/kernel/smpboot.c | 20 ++++++++++++++++++++ > 3 files changed, 37 insertions(+) > > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -7241,6 +7241,18 @@ Kernel parameters > Not specifying this option is equivalent to > spec_store_bypass_disable=auto. > > + split_llc= > + [X86,EARLY] Split the LLC N-ways > + > + When set, the LLC is split this many ways by matching > + 'core_id % n'. This is setup before SMP bringup and > + used during SMP bringup before it knows the full > + topology. If your core count doesn't nicely divide by > + the number given, you get to keep the pieces. > + > + This is mostly a debug feature to emulate multiple LLCs > + on hardware that only have a single LLC. > + > split_lock_detect= > [X86] Enable split lock detection or bus lock detection > > --- a/arch/x86/include/asm/processor.h > +++ b/arch/x86/include/asm/processor.h > @@ -699,6 +699,11 @@ static inline u32 per_cpu_l2c_id(unsigne > return per_cpu(cpu_info.topo.l2c_id, cpu); > } > > +static inline u32 per_cpu_core_id(unsigned int cpu) > +{ > + return per_cpu(cpu_info.topo.core_id, cpu); > +} > + > #ifdef CONFIG_CPU_SUP_AMD > /* > * Issue a DIV 0/1 insn to clear any division data from previous DIV > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -424,6 +424,21 @@ static const struct x86_cpu_id intel_cod > {} > }; > > +/* > + * Allows splitting the LLC by matching 'core_id % split_llc'. > + * > + * This is mostly a debug hack to emulate systems with multiple LLCs per node > + * on systems that do not naturally have this. > + */ > +static unsigned int split_llc = 0; > + > +static int __init split_llc_setup(char *str) > +{ > + get_option(&str, &split_llc); > + return 0; > +} > +early_param("split_llc", split_llc_setup); > + > static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o) > { > const struct x86_cpu_id *id = x86_match_cpu(intel_cod_cpu); > @@ -438,6 +453,11 @@ static bool match_llc(struct cpuinfo_x86 > if (per_cpu_llc_id(cpu1) != per_cpu_llc_id(cpu2)) > return false; > > + if (split_llc && > + (per_cpu_core_id(cpu1) % split_llc) != > + (per_cpu_core_id(cpu2) % split_llc)) > + return false; > + > /* > * Allow the SNC topology without warning. Return of false > * means 'c' does not share the LLC of 'o'. This will be ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2025-07-21 6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng ` (2 preceding siblings ...) 2025-07-21 6:10 ` [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng @ 2025-07-21 6:10 ` Pan Deng 2026-03-20 12:40 ` Peter Zijlstra 2026-03-20 9:59 ` [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Peter Zijlstra 4 siblings, 1 reply; 41+ messages in thread From: Pan Deng @ 2025-07-21 6:10 UTC (permalink / raw) To: peterz, mingo; +Cc: linux-kernel, tianyou.li, tim.c.chen, yu.c.chen, pan.deng When running a multi-instance FFmpeg workload on HCC system, significant contention is observed on bitmap of `cpupri_vec->cpumask`. The SUT is a 2-socket machine with 240 physical cores and 480 logical CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 with FIFO scheduling. FPS is used as score. perf c2c tool reveals: cpumask (bitmap) cache line of `cpupri_vec->mask`: - bits are loaded during cpupri_find - bits are stored during cpupri_set - cycles per load: ~2.2K to 8.7K This change splits `cpupri_vec->cpumask` into per-NUMA-node data to mitigate false sharing. As a result: - FPS improves by ~3.8% - Kernel cycles% drops from ~20% to ~18.7% - Cache line contention is mitigated, perf-c2c shows cycles per load drops from ~2.2K-8.7K to ~0.5K-2.2K - stress-ng cyclic benchmark is improved ~5.9%, command: stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \ --timeout 30 --minimize --metrics - rt-tests/pi_stress is improved ~9.3%, command: rt-tests/pi_stress -D 30 -g $(($(nproc) / 2)) Note: CONFIG_CPUMASK_OFFSTACK=n remains unchanged. Appendix: 1. Perf c2c report of `cpupri_vec->mask` bitmap cache line: ------- ------- ------ ------ ------ ------ ------------------------ Rmt Lcl Store Data Load Total Symbol Hitm% Hitm% L1 Hit% offset cycles records ------- ------- ------ ------ ------ ------ ------------------------ 155 39 39 0xff14d52c4682d800 ------- ------- ------ ------ ------ ------ ------------------------ 43.23% 43.59% 0.00% 0x0 3489 415 _find_first_and_bit 3.23% 5.13% 0.00% 0x0 3478 107 __bitmap_and 3.23% 0.00% 0.00% 0x0 2712 33 _find_first_and_bit 1.94% 0.00% 7.69% 0x0 5992 33 cpupri_set 0.00% 0.00% 5.13% 0x0 3733 19 cpupri_set 12.90% 12.82% 0.00% 0x8 3452 297 _find_first_and_bit 1.29% 2.56% 0.00% 0x8 3007 117 __bitmap_and 0.00% 5.13% 0.00% 0x8 3041 20 _find_first_and_bit 0.00% 2.56% 2.56% 0x8 2374 22 cpupri_set 0.00% 0.00% 7.69% 0x8 4194 38 cpupri_set 8.39% 2.56% 0.00% 0x10 3336 264 _find_first_and_bit 3.23% 0.00% 0.00% 0x10 3023 46 _find_first_and_bit 2.58% 0.00% 0.00% 0x10 3040 130 __bitmap_and 1.29% 0.00% 12.82% 0x10 4075 34 cpupri_set 0.00% 0.00% 2.56% 0x10 2197 19 cpupri_set 0.00% 2.56% 7.69% 0x18 4085 27 cpupri_set 0.00% 2.56% 0.00% 0x18 3128 220 _find_first_and_bit 0.00% 0.00% 5.13% 0x18 3028 20 cpupri_set 2.58% 2.56% 0.00% 0x20 3089 198 _find_first_and_bit 1.29% 0.00% 5.13% 0x20 5114 29 cpupri_set 0.65% 2.56% 0.00% 0x20 3224 96 __bitmap_and 0.65% 0.00% 7.69% 0x20 4392 31 cpupri_set 2.58% 0.00% 0.00% 0x28 3327 214 _find_first_and_bit 0.65% 2.56% 5.13% 0x28 5252 31 cpupri_set 0.65% 0.00% 7.69% 0x28 8755 25 cpupri_set 0.65% 0.00% 0.00% 0x28 4414 14 _find_first_and_bit 1.29% 2.56% 0.00% 0x30 3139 171 _find_first_and_bit 0.65% 0.00% 7.69% 0x30 2185 18 cpupri_set 0.65% 0.00% 0.00% 0x30 3404 108 __bitmap_and 0.00% 0.00% 2.56% 0x30 5542 21 cpupri_set 3.23% 5.13% 0.00% 0x38 3493 190 _find_first_and_bit 3.23% 2.56% 0.00% 0x38 3171 108 __bitmap_and 0.00% 2.56% 7.69% 0x38 3285 14 cpupri_set 0.00% 0.00% 5.13% 0x38 4035 27 cpupri_set Signed-off-by: Pan Deng <pan.deng@intel.com> Reviewed-by: Tianyou Li <tianyou.li@intel.com> Reviewed-by: Chen Yu <yu.c.chen@intel.com> --- kernel/sched/cpupri.c | 200 ++++++++++++++++++++++++++++++++++++++---- kernel/sched/cpupri.h | 4 + 2 files changed, 186 insertions(+), 18 deletions(-) diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c index 42c40cfdf836..306b6baff4cd 100644 --- a/kernel/sched/cpupri.c +++ b/kernel/sched/cpupri.c @@ -64,6 +64,143 @@ static int convert_prio(int prio) return cpupri; } +#ifdef CONFIG_CPUMASK_OFFSTACK +static inline int alloc_vec_masks(struct cpupri_vec *vec) +{ + int i; + + for (i = 0; i < nr_node_ids; i++) { + if (!zalloc_cpumask_var_node(&vec->masks[i], GFP_KERNEL, i)) + goto cleanup; + + // Clear masks of cur node, set others + bitmap_complement(cpumask_bits(vec->masks[i]), + cpumask_bits(cpumask_of_node(i)), small_cpumask_bits); + } + return 0; + +cleanup: + while (i--) + free_cpumask_var(vec->masks[i]); + return -ENOMEM; +} + +static inline void free_vec_masks(struct cpupri_vec *vec) +{ + for (int i = 0; i < nr_node_ids; i++) + free_cpumask_var(vec->masks[i]); +} + +static inline int setup_vec_mask_var_ts(struct cpupri *cp) +{ + int i; + + for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) { + struct cpupri_vec *vec = &cp->pri_to_cpu[i]; + + vec->masks = kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL); + if (!vec->masks) + goto cleanup; + } + return 0; + +cleanup: + /* Free any already allocated masks */ + while (i--) { + kfree(cp->pri_to_cpu[i].masks); + cp->pri_to_cpu[i].masks = NULL; + } + + return -ENOMEM; +} + +static inline void free_vec_mask_var_ts(struct cpupri *cp) +{ + for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++) { + kfree(cp->pri_to_cpu[i].masks); + cp->pri_to_cpu[i].masks = NULL; + } +} + +static inline int +available_cpu_in_nodes(struct task_struct *p, struct cpupri_vec *vec) +{ + int cur_node = numa_node_id(); + + for (int i = 0; i < nr_node_ids; i++) { + int nid = (cur_node + i) % nr_node_ids; + + if (cpumask_first_and_and(&p->cpus_mask, vec->masks[nid], + cpumask_of_node(nid)) < nr_cpu_ids) + return 1; + } + + return 0; +} + +#define available_cpu_in_vec available_cpu_in_nodes + +#else /* !CONFIG_CPUMASK_OFFSTACK */ + +static inline int alloc_vec_masks(struct cpupri_vec *vec) +{ + if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL)) + return -ENOMEM; + + return 0; +} + +static inline void free_vec_masks(struct cpupri_vec *vec) +{ + free_cpumask_var(vec->mask); +} + +static inline int setup_vec_mask_var_ts(struct cpupri *cp) +{ + return 0; +} + +static inline void free_vec_mask_var_ts(struct cpupri *cp) +{ +} + +static inline int +available_cpu_in_vec(struct task_struct *p, struct cpupri_vec *vec) +{ + if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids) + return 0; + + return 1; +} +#endif + +static inline int alloc_all_masks(struct cpupri *cp) +{ + int i; + + for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) { + if (alloc_vec_masks(&cp->pri_to_cpu[i])) + goto cleanup; + } + + return 0; + +cleanup: + while (i--) + free_vec_masks(&cp->pri_to_cpu[i]); + + return -ENOMEM; +} + +static inline void setup_vec_counts(struct cpupri *cp) +{ + for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++) { + struct cpupri_vec *vec = &cp->pri_to_cpu[i]; + + atomic_set(&vec->count, 0); + } +} + static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p, struct cpumask *lowest_mask, int idx) { @@ -96,11 +233,24 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p, if (skip) return 0; - if (cpumask_any_and(&p->cpus_mask, vec->mask) >= nr_cpu_ids) + if (!available_cpu_in_vec(p, vec)) return 0; +#ifdef CONFIG_CPUMASK_OFFSTACK + struct cpumask *cpupri_mask = lowest_mask; + + // available && lowest_mask + if (lowest_mask) { + cpumask_copy(cpupri_mask, vec->masks[0]); + for (int nid = 1; nid < nr_node_ids; nid++) + cpumask_and(cpupri_mask, cpupri_mask, vec->masks[nid]); + } +#else + struct cpumask *cpupri_mask = vec->mask; +#endif + if (lowest_mask) { - cpumask_and(lowest_mask, &p->cpus_mask, vec->mask); + cpumask_and(lowest_mask, &p->cpus_mask, cpupri_mask); cpumask_and(lowest_mask, lowest_mask, cpu_active_mask); /* @@ -229,7 +379,11 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri) if (likely(newpri != CPUPRI_INVALID)) { struct cpupri_vec *vec = &cp->pri_to_cpu[newpri]; +#ifdef CONFIG_CPUMASK_OFFSTACK + cpumask_set_cpu(cpu, vec->masks[cpu_to_node(cpu)]); +#else cpumask_set_cpu(cpu, vec->mask); +#endif /* * When adding a new vector, we update the mask first, * do a write memory barrier, and then update the count, to @@ -263,7 +417,11 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri) */ atomic_dec(&(vec)->count); smp_mb__after_atomic(); +#ifdef CONFIG_CPUMASK_OFFSTACK + cpumask_clear_cpu(cpu, vec->masks[cpu_to_node(cpu)]); +#else cpumask_clear_cpu(cpu, vec->mask); +#endif } *currpri = newpri; @@ -279,26 +437,31 @@ int cpupri_init(struct cpupri *cp) { int i; - for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) { - struct cpupri_vec *vec = &cp->pri_to_cpu[i]; - - atomic_set(&vec->count, 0); - if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL)) - goto cleanup; - } - + /* Allocate the cpu_to_pri array */ cp->cpu_to_pri = kcalloc(nr_cpu_ids, sizeof(int), GFP_KERNEL); if (!cp->cpu_to_pri) - goto cleanup; + return -ENOMEM; + /* Initialize all CPUs to invalid priority */ for_each_possible_cpu(i) cp->cpu_to_pri[i] = CPUPRI_INVALID; + /* Setup priority vectors */ + setup_vec_counts(cp); + if (setup_vec_mask_var_ts(cp)) + goto fail_setup_vectors; + + /* Allocate masks for each priority vector */ + if (alloc_all_masks(cp)) + goto fail_alloc_masks; + return 0; -cleanup: - for (i--; i >= 0; i--) - free_cpumask_var(cp->pri_to_cpu[i].mask); +fail_alloc_masks: + free_vec_mask_var_ts(cp); + +fail_setup_vectors: + kfree(cp->cpu_to_pri); return -ENOMEM; } @@ -308,9 +471,10 @@ int cpupri_init(struct cpupri *cp) */ void cpupri_cleanup(struct cpupri *cp) { - int i; - kfree(cp->cpu_to_pri); - for (i = 0; i < CPUPRI_NR_PRIORITIES; i++) - free_cpumask_var(cp->pri_to_cpu[i].mask); + + for (int i = 0; i < CPUPRI_NR_PRIORITIES; i++) + free_vec_masks(&cp->pri_to_cpu[i]); + + free_vec_mask_var_ts(cp); } diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h index 245b0fa626be..c53f1f4dad86 100644 --- a/kernel/sched/cpupri.h +++ b/kernel/sched/cpupri.h @@ -9,7 +9,11 @@ struct cpupri_vec { atomic_t count; +#ifdef CONFIG_CPUMASK_OFFSTACK + cpumask_var_t *masks ____cacheline_aligned; +#else cpumask_var_t mask ____cacheline_aligned; +#endif }; struct cpupri { -- 2.43.5 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2025-07-21 6:10 ` [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng @ 2026-03-20 12:40 ` Peter Zijlstra 2026-03-23 18:45 ` Tim Chen 0 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2026-03-20 12:40 UTC (permalink / raw) To: Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen On Mon, Jul 21, 2025 at 02:10:26PM +0800, Pan Deng wrote: > This change splits `cpupri_vec->cpumask` into per-NUMA-node data to > mitigate false sharing. So I really do think we need something here. We're running into the whole cpumask contention thing on a semi regular basis. But somehow I doubt this is it. I would suggest building a radix tree like structure based on ACPIID -- which is inherently suitable for this given that is exactly how CPUID-0b/1f are specified. This of course makes it very much x86 specific, but perhaps other architectures can provide similarly structured id spaces suitable for this. If you make it so that it reduces to a single large level (equivalent to the normal bitmaps) when no intermediate masks are specific, it should work for all, and then architectures can opt-in by providing a suitable id space and masks. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-03-20 12:40 ` Peter Zijlstra @ 2026-03-23 18:45 ` Tim Chen 2026-03-24 12:00 ` Peter Zijlstra 0 siblings, 1 reply; 41+ messages in thread From: Tim Chen @ 2026-03-23 18:45 UTC (permalink / raw) To: Peter Zijlstra, Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, yu.c.chen On Fri, 2026-03-20 at 13:40 +0100, Peter Zijlstra wrote: > On Mon, Jul 21, 2025 at 02:10:26PM +0800, Pan Deng wrote: > > > This change splits `cpupri_vec->cpumask` into per-NUMA-node data to > > mitigate false sharing. > > So I really do think we need something here. We're running into the > whole cpumask contention thing on a semi regular basis. > > But somehow I doubt this is it. > > I would suggest building a radix tree like structure based on ACPIID > -- which is inherently suitable for this given that is exactly how > CPUID-0b/1f are specified. > Are you thinking about replacing cpumask in cpupri_vec with something like xarray? And a question on using ACPIID for the CPU as index instead of CPUID. Is it because you want to even out access in the tree? Tim > This of course makes it very much x86 specific, but perhaps other > architectures can provide similarly structured id spaces suitable for > this. > > If you make it so that it reduces to a single large level (equivalent to > the normal bitmaps) when no intermediate masks are specific, it should > work for all, and then architectures can opt-in by providing a suitable > id space and masks. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-03-23 18:45 ` Tim Chen @ 2026-03-24 12:00 ` Peter Zijlstra 2026-03-31 5:37 ` Chen, Yu C 0 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2026-03-24 12:00 UTC (permalink / raw) To: Tim Chen Cc: Pan Deng, mingo, linux-kernel, tianyou.li, yu.c.chen, kprateek.nayak On Mon, Mar 23, 2026 at 11:45:01AM -0700, Tim Chen wrote: > On Fri, 2026-03-20 at 13:40 +0100, Peter Zijlstra wrote: > > On Mon, Jul 21, 2025 at 02:10:26PM +0800, Pan Deng wrote: > > > > > This change splits `cpupri_vec->cpumask` into per-NUMA-node data to > > > mitigate false sharing. > > > > So I really do think we need something here. We're running into the > > whole cpumask contention thing on a semi regular basis. > > > > But somehow I doubt this is it. > > > > I would suggest building a radix tree like structure based on ACPIID > > -- which is inherently suitable for this given that is exactly how > > CPUID-0b/1f are specified. > > > > Are you thinking about replacing cpumask in cpupri_vec with something like xarray? > And a question on using ACPIID for the CPU as index instead of CPUID. > Is it because you want to even out access in the tree? Sorry, s/ACPI/APIC/, I keep cursing the sadist that put those two acronyms together. No, because I want it sorted by topology. Per virtue of CPUID-b/1f the APIC-ID is in topology order. Perhaps a little something like this. It will obviously only build on x86, and then only boot for those that have <=64 CPUs in their DIE domain. This needs ARCH_HAS_SBM and a cpumask based fallback implementation of sbm at the very least. Now, I was hoping AMD EPYC would have their CCD things as a topology level, but going by the MADT/CPUID dump I got from Boris, this is not the case. So we need to manually insert that level and hope the APIC-ID range is nicely setup for that, or they need to do worse things still. I think that for things like DMR DIE DTRT, but I've not yet seem one upclose :/ Also, I really wish all the SNC capable chips would have the SNC domains enumerated, even if SNC is not in use. This x86 topology enumeration stuff is such a shit show :-( Anyway, random hackery below, it basically does a 2 level structure where the leaf is a whole cacheline (double check the kzalloc_obj() stuff respects alignment) and we make sure the CPUs for that leaf are actually from the same cache domain. At least, that's the theory, see above ranting on the glories of topology enumeration. It boots in qemu with --cpus 16,sockets=2,dies=2 and appears to 'work'. YMMV This code is very much a PoC, treat it as such. --- diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 9cd493d467d4..24012a91ac1e 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -54,6 +54,7 @@ static inline void x86_32_probe_apic(void) { } #endif extern u32 cpuid_to_apicid[]; +extern u32 apicid_to_cpuid[]; #define CPU_ACPIID_INVALID U32_MAX diff --git a/arch/x86/include/asm/sbm.h b/arch/x86/include/asm/sbm.h new file mode 100644 index 000000000000..9a4d283347d1 --- /dev/null +++ b/arch/x86/include/asm/sbm.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include <asm/apic.h> + +static __always_inline u32 arch_sbm_cpu_to_idx(unsigned int cpu) +{ + return cpuid_to_apicid[cpu]; +} + +static __always_inline u32 arch_sbm_idx_to_cpu(unsigned int idx) +{ + return apicid_to_cpuid[idx]; +} diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c index eafcb1fc185a..6f3d18288600 100644 --- a/arch/x86/kernel/cpu/topology.c +++ b/arch/x86/kernel/cpu/topology.c @@ -48,6 +48,12 @@ DECLARE_BITMAP(phys_cpu_present_map, MAX_LOCAL_APIC) __read_mostly; /* Used for CPU number allocation and parallel CPU bringup */ u32 cpuid_to_apicid[] __ro_after_init = { [0 ... NR_CPUS - 1] = BAD_APICID, }; +u32 apicid_to_cpuid[MAX_LOCAL_APIC] = { 0 }; + +u32 arch_sbm_leafs __ro_after_init; +u32 arch_sbm_shift __ro_after_init; +u32 arch_sbm_mask __ro_after_init; +u32 arch_sbm_bits __ro_after_init; /* Bitmaps to mark registered APICs at each topology domain */ static struct { DECLARE_BITMAP(map, MAX_LOCAL_APIC); } apic_maps[TOPO_MAX_DOMAIN] __ro_after_init; @@ -234,6 +240,7 @@ static __init void topo_register_apic(u32 apic_id, u32 acpi_id, bool present) cpu = topo_get_cpunr(apic_id); cpuid_to_apicid[cpu] = apic_id; + apicid_to_cpuid[apic_id] = cpu; topo_set_cpuids(cpu, apic_id, acpi_id); } else { topo_info.nr_disabled_cpus++; @@ -537,7 +544,9 @@ void __init topology_init_possible_cpus(void) MAX_LOCAL_APIC, apicid); if (apicid >= MAX_LOCAL_APIC) break; - cpuid_to_apicid[topo_info.nr_assigned_cpus++] = apicid; + cpu = topo_info.nr_assigned_cpus++; + cpuid_to_apicid[cpu] = apicid; + apicid_to_cpuid[apicid] = cpu; } for (cpu = 0; cpu < allowed; cpu++) { @@ -551,6 +560,17 @@ void __init topology_init_possible_cpus(void) cpu_mark_primary_thread(cpu, apicid); set_cpu_present(cpu, test_bit(apicid, phys_cpu_present_map)); } + + apicid = 0; + for_each_possible_cpu(cpu) + apicid = max(apicid, cpuid_to_apicid[cpu]); + + arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1; + arch_sbm_leafs = 1 + (apicid >> arch_sbm_shift); + arch_sbm_mask = (1 << arch_sbm_shift) - 1; + arch_sbm_bits = arch_sbm_shift; + + pr_info("SBM: shift(%d) leafs(%d) APIC(%x)\n", arch_sbm_shift, arch_sbm_leafs, apicid); } /* diff --git a/include/linux/sbm.h b/include/linux/sbm.h new file mode 100644 index 000000000000..8beade6c0585 --- /dev/null +++ b/include/linux/sbm.h @@ -0,0 +1,83 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_SBM_H +#define _LINUX_SBM_H + +#include <linux/slab.h> +#include <linux/bitmap.h> +#include <linux/cpumask.h> +#include <asm/sbm.h> + +extern unsigned int arch_sbm_leafs; +extern unsigned int arch_sbm_shift; +extern unsigned int arch_sbm_mask; +extern unsigned int arch_sbm_bits; + +extern unsigned int arch_sbm_cpu_to_idx(unsigned int cpu); +extern unsigned int arch_sbm_idx_to_cpu(unsigned int idx); + +enum sbm_type { + st_root = 0, + st_leaf, +}; + +struct sbm_root { + enum sbm_type type; + unsigned int nr; + struct sbm_leaf *leafs[] __counted_by(nr); +}; + +struct sbm_leaf { + enum sbm_type type; + unsigned long bitmap; +} ____cacheline_aligned; + +struct sbm { + enum sbm_type type; +}; + +extern struct sbm *sbm_alloc(void); +extern unsigned int sbm_find_next_bit(struct sbm *sbm, int start); + +#define __sbm_op(sbm, func) \ +({ \ + struct sbm_leaf *leaf = (void *)sbm; \ + int idx = arch_sbm_cpu_to_idx(cpu); \ + if (sbm->type == st_root) { \ + struct sbm_root *root = (void *)sbm; \ + int nr = idx >> arch_sbm_shift; \ + leaf = root->leafs[nr]; \ + } \ + int bit = idx & arch_sbm_mask; \ + func(bit, &leaf->bitmap); \ +}) + +static inline void sbm_cpu_set(struct sbm *sbm, int cpu) +{ + __sbm_op(sbm, set_bit); +} + +static inline void sbm_cpu_clear(struct sbm *sbm, int cpu) +{ + __sbm_op(sbm, clear_bit); +} + +static inline void __sbm_cpu_set(struct sbm *sbm, int cpu) +{ + __sbm_op(sbm, __set_bit); +} + +static inline void __sbm_cpu_clear(struct sbm *sbm, int cpu) +{ + __sbm_op(sbm, __clear_bit); +} + +static inline bool sbm_cpu_test(struct sbm *sbm, int cpu) +{ + return __sbm_op(sbm, test_bit); +} + +#define sbm_for_each_set_bit(sbm, idx) \ + for (int idx = sbm_find_next_bit(sbm, 0); \ + idx >= 0; idx = sbm_find_next_bit(sbm, idx+1)) + +#endif /* _LINUX_SBM_H */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 226509231e67..a3a423c4706e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -49,6 +49,7 @@ #include <linux/ratelimit.h> #include <linux/task_work.h> #include <linux/rbtree_augmented.h> +#include <linux/sbm.h> #include <asm/switch_to.h> @@ -7384,7 +7385,7 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask); #ifdef CONFIG_NO_HZ_COMMON static struct { - cpumask_var_t idle_cpus_mask; + struct sbm *sbm; int has_blocked_load; /* Idle CPUS has blocked load */ int needs_update; /* Newly idle CPUs need their next_balance collated */ unsigned long next_balance; /* in jiffy units */ @@ -12615,12 +12616,11 @@ static inline int on_null_domain(struct rq *rq) static inline int find_new_ilb(void) { int this_cpu = smp_processor_id(); - const struct cpumask *hk_mask; int ilb_cpu; - hk_mask = housekeeping_cpumask(HK_TYPE_KERNEL_NOISE); + sbm_for_each_set_bit(nohz.sbm, idx) { + ilb_cpu = arch_sbm_idx_to_cpu(idx); - for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) { if (ilb_cpu == this_cpu) continue; @@ -12685,7 +12685,7 @@ static void nohz_balancer_kick(struct rq *rq) unsigned long now = jiffies; struct sched_domain_shared *sds; struct sched_domain *sd; - int nr_busy, i, cpu = rq->cpu; + int nr_busy, cpu = rq->cpu; unsigned int flags = 0; if (unlikely(rq->idle_balance)) @@ -12713,13 +12713,6 @@ static void nohz_balancer_kick(struct rq *rq) if (time_before(now, nohz.next_balance)) goto out; - /* - * None are in tickless mode and hence no need for NOHZ idle load - * balancing - */ - if (unlikely(cpumask_empty(nohz.idle_cpus_mask))) - return; - if (rq->nr_running >= 2) { flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK; goto out; @@ -12739,24 +12732,6 @@ static void nohz_balancer_kick(struct rq *rq) } } - sd = rcu_dereference_all(per_cpu(sd_asym_packing, cpu)); - if (sd) { - /* - * When ASYM_PACKING; see if there's a more preferred CPU - * currently idle; in which case, kick the ILB to move tasks - * around. - * - * When balancing between cores, all the SMT siblings of the - * preferred CPU must be idle. - */ - for_each_cpu_and(i, sched_domain_span(sd), nohz.idle_cpus_mask) { - if (sched_asym(sd, i, cpu)) { - flags = NOHZ_STATS_KICK | NOHZ_BALANCE_KICK; - goto unlock; - } - } - } - sd = rcu_dereference_all(per_cpu(sd_asym_cpucapacity, cpu)); if (sd) { /* @@ -12829,7 +12804,8 @@ void nohz_balance_exit_idle(struct rq *rq) return; rq->nohz_tick_stopped = 0; - cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask); + if (cpumask_test_cpu(rq->cpu, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE))) + sbm_cpu_clear(nohz.sbm, rq->cpu); set_cpu_sd_state_busy(rq->cpu); } @@ -12886,7 +12862,8 @@ void nohz_balance_enter_idle(int cpu) rq->nohz_tick_stopped = 1; - cpumask_set_cpu(cpu, nohz.idle_cpus_mask); + if (cpumask_test_cpu(rq->cpu, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE))) + sbm_cpu_set(nohz.sbm, rq->cpu); /* * Ensures that if nohz_idle_balance() fails to observe our @@ -12913,7 +12890,7 @@ static bool update_nohz_stats(struct rq *rq) if (!rq->has_blocked_load) return false; - if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask)) + if (!sbm_cpu_test(nohz.sbm, cpu)) return false; if (!time_after(jiffies, READ_ONCE(rq->last_blocked_load_update_tick))) @@ -12967,7 +12944,9 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags) * Start with the next CPU after this_cpu so we will end with this_cpu and let a * chance for other idle cpu to pull load. */ - for_each_cpu_wrap(balance_cpu, nohz.idle_cpus_mask, this_cpu+1) { + sbm_for_each_set_bit(nohz.sbm, idx) { + balance_cpu = arch_sbm_idx_to_cpu(idx); + if (!idle_cpu(balance_cpu)) continue; @@ -14250,6 +14229,6 @@ __init void init_sched_fair_class(void) #ifdef CONFIG_NO_HZ_COMMON nohz.next_balance = jiffies; nohz.next_blocked = jiffies; - zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); + nohz.sbm = sbm_alloc(); #endif } diff --git a/lib/Makefile b/lib/Makefile index 1b9ee167517f..8d1f6b5327d5 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -40,7 +40,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ is_single_threaded.o plist.o decompress.o kobject_uevent.o \ earlycpio.o seq_buf.o siphash.o dec_and_lock.o \ nmi_backtrace.o win_minmax.o memcat_p.o \ - buildid.o objpool.o iomem_copy.o sys_info.o + buildid.o objpool.o iomem_copy.o sys_info.o sbm.o lib-$(CONFIG_UNION_FIND) += union_find.o lib-$(CONFIG_PRINTK) += dump_stack.o diff --git a/lib/sbm.c b/lib/sbm.c new file mode 100644 index 000000000000..167cf857cd32 --- /dev/null +++ b/lib/sbm.c @@ -0,0 +1,57 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include <linux/sbm.h> + +struct sbm *sbm_alloc(void) +{ + unsigned int nr = arch_sbm_leafs; + struct sbm_root *root = kzalloc_flex(*root, leafs, nr); + struct sbm_leaf *leaf; + if (!root) + return NULL; + + root->type = st_root; + + for (int i = 0; i < nr; i++) { + leaf = kzalloc_obj(*leaf); + if (!leaf) + goto fail; + leaf->type = st_leaf; + root->leafs[i] = leaf; + } + + if (nr == 1) { + leaf = root->leafs[0]; + kfree(root); + return (void *)leaf; + } + + return (void *)root; + +fail: + for (int i = 0; i < nr; i++) + kfree(root->leafs[i]); + kfree(root); + return NULL; +} + +unsigned int sbm_find_next_bit(struct sbm *sbm, int start) +{ + struct sbm_leaf *leaf = (void *)sbm; + struct sbm_root *root = (void *)sbm; + int nr = start >> arch_sbm_shift; + int bit = start & arch_sbm_mask; + unsigned long tmp, mask = (~0UL) << bit; + if (sbm->type == st_root) { + for (; nr < arch_sbm_leafs; nr++, mask = ~0UL) { + leaf = root->leafs[nr]; + tmp = leaf->bitmap & mask; + if (!tmp) + continue; + } + } else { + tmp = leaf->bitmap & mask; + } + if (!tmp) + return -1; + return (nr << arch_sbm_shift) | __ffs(tmp); +} ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-03-24 12:00 ` Peter Zijlstra @ 2026-03-31 5:37 ` Chen, Yu C 2026-03-31 10:19 ` K Prateek Nayak 0 siblings, 1 reply; 41+ messages in thread From: Chen, Yu C @ 2026-03-31 5:37 UTC (permalink / raw) To: Peter Zijlstra, Tim Chen Cc: Pan Deng, mingo, linux-kernel, tianyou.li, kprateek.nayak On 3/24/2026 8:00 PM, Peter Zijlstra wrote: > On Mon, Mar 23, 2026 at 11:45:01AM -0700, Tim Chen wrote: >> On Fri, 2026-03-20 at 13:40 +0100, Peter Zijlstra wrote: >>> On Mon, Jul 21, 2025 at 02:10:26PM +0800, Pan Deng wrote: >>> >>>> This change splits `cpupri_vec->cpumask` into per-NUMA-node data to >>>> mitigate false sharing. >>> >>> So I really do think we need something here. We're running into the >>> whole cpumask contention thing on a semi regular basis. >>> [ ... ] > + > +unsigned int sbm_find_next_bit(struct sbm *sbm, int start) > +{ > + struct sbm_leaf *leaf = (void *)sbm; > + struct sbm_root *root = (void *)sbm; > + int nr = start >> arch_sbm_shift; > + int bit = start & arch_sbm_mask; > + unsigned long tmp, mask = (~0UL) << bit; > + if (sbm->type == st_root) { > + for (; nr < arch_sbm_leafs; nr++, mask = ~0UL) { > + leaf = root->leafs[nr]; > + tmp = leaf->bitmap & mask; > + if (!tmp) > + continue; I suppose this should be if (tmp) break; otherwise [ 40.071616] watchdog: BUG: soft lockup - CPU#0 stuck for 30s! [swapper/0:0] [ 40.071616] Modules linked in: [ 40.071616] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 7.0.0-rc5-sbm-+ #16 PREEMPT(full) [ 40.071616] RIP: 0010:sbm_find_next_bit+0x2a/0xa0 > + } > + } else { > + tmp = leaf->bitmap & mask; > + } > + if (!tmp) > + return -1; > + return (nr << arch_sbm_shift) | __ffs(tmp); > +} update of the test: With above change, I did a simple hackbench test on a system with multiple LLCs within 1 node, so the benefit is significant(+12%~+30%) when system is under-loaded, while some regression when overloaded(-10%)(need to figure out) thanks, Chenyu ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-03-31 5:37 ` Chen, Yu C @ 2026-03-31 10:19 ` K Prateek Nayak 2026-04-02 3:15 ` Chen, Yu C 0 siblings, 1 reply; 41+ messages in thread From: K Prateek Nayak @ 2026-03-31 10:19 UTC (permalink / raw) To: Chen, Yu C, Peter Zijlstra, Tim Chen Cc: Pan Deng, mingo, linux-kernel, tianyou.li Hello Chenyu, On 3/31/2026 11:07 AM, Chen, Yu C wrote: > update of the test: > With above change, I did a simple hackbench test on > a system with multiple LLCs within 1 node, so the benefit > is significant(+12%~+30%) when system is under-loaded, while > some regression when overloaded(-10%)(need to figure out) Could it be because of how we are traversing the CPUs now for idle load balancing? Since we use the first set bit for ilb_cpu and also staring balancing from that very CPu, we might just stop after a successful balance on the ilb_cpu. Would something like below on top of Peter's suggestion + your fix help? (lightly tested; Has survived sched messaging on baremetal) diff --git a/include/linux/sbm.h b/include/linux/sbm.h index 8beade6c0585..98c4c1866534 100644 --- a/include/linux/sbm.h +++ b/include/linux/sbm.h @@ -76,8 +76,45 @@ static inline bool sbm_cpu_test(struct sbm *sbm, int cpu) return __sbm_op(sbm, test_bit); } +static __always_inline +unsigned int sbm_find_next_bit_wrap(struct sbm *sbm, int start) +{ + int bit = sbm_find_next_bit(sbm, start); + + if (bit >= 0 || start == 0) + return bit; + + bit = sbm_find_next_bit(sbm, 0); + return bit < start ? bit : -1; +} + +static __always_inline +unsigned int __sbm_for_each_wrap(struct sbm *sbm, int start, int n) +{ + int bit; + + /* If not wrapped around */ + if (n > start) { + /* and have a bit, just return it. */ + bit = sbm_find_next_bit(sbm, n); + if (bit >= 0) + return bit; + + /* Otherwise, wrap around and ... */ + n = 0; + } + + /* Search the other part. */ + bit = sbm_find_next_bit(sbm, n); + return bit < start ? bit : -1; +} + #define sbm_for_each_set_bit(sbm, idx) \ for (int idx = sbm_find_next_bit(sbm, 0); \ idx >= 0; idx = sbm_find_next_bit(sbm, idx+1)) +#define sbm_for_each_set_bit_wrap(sbm, idx, start) \ + for (int idx = sbm_find_next_bit_wrap(sbm, start); \ + idx >= 0; idx = __sbm_for_each_wrap(sbm, start, idx+1)) + #endif /* _LINUX_SBM_H */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a3a423c4706e..f485afb6286d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12916,6 +12916,7 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags) int this_cpu = this_rq->cpu; int balance_cpu; struct rq *rq; + u32 start; WARN_ON_ONCE((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); @@ -12944,7 +12945,8 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags) * Start with the next CPU after this_cpu so we will end with this_cpu and let a * chance for other idle cpu to pull load. */ - sbm_for_each_set_bit(nohz.sbm, idx) { + start = arch_sbm_cpu_to_idx((this_cpu + 1) % nr_cpu_ids); + sbm_for_each_set_bit_wrap(nohz.sbm, idx, start) { balance_cpu = arch_sbm_idx_to_cpu(idx); if (!idle_cpu(balance_cpu)) --- This is pretty much giving me similar performance as tip for sched messaging runs under heavy load but your mileage may vary :-) -- Thanks and Regards, Prateek ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-03-31 10:19 ` K Prateek Nayak @ 2026-04-02 3:15 ` Chen, Yu C 2026-04-02 4:41 ` K Prateek Nayak 0 siblings, 1 reply; 41+ messages in thread From: Chen, Yu C @ 2026-04-02 3:15 UTC (permalink / raw) To: K Prateek Nayak, Peter Zijlstra, Tim Chen Cc: Pan Deng, mingo, linux-kernel, tianyou.li Hello Prateek, On 3/31/2026 6:19 PM, K Prateek Nayak wrote: > Hello Chenyu, > > On 3/31/2026 11:07 AM, Chen, Yu C wrote: >> update of the test: >> With above change, I did a simple hackbench test on >> a system with multiple LLCs within 1 node, so the benefit >> is significant(+12%~+30%) when system is under-loaded, while >> some regression when overloaded(-10%)(need to figure out) > > Could it be because of how we are traversing the CPUs now for idle load > balancing? Since we use the first set bit for ilb_cpu and also staring > balancing from that very CPu, we might just stop after a successful > balance on the ilb_cpu. > > Would something like below on top of Peter's suggestion + your fix help? > > (lightly tested; Has survived sched messaging on baremetal) > > diff --git a/include/linux/sbm.h b/include/linux/sbm.h > index 8beade6c0585..98c4c1866534 100644 > --- a/include/linux/sbm.h > +++ b/include/linux/sbm.h > @@ -76,8 +76,45 @@ static inline bool sbm_cpu_test(struct sbm *sbm, int cpu) > return __sbm_op(sbm, test_bit); > } > > +static __always_inline > +unsigned int sbm_find_next_bit_wrap(struct sbm *sbm, int start) > +{ > + int bit = sbm_find_next_bit(sbm, start); > + > + if (bit >= 0 || start == 0) > + return bit; > + > + bit = sbm_find_next_bit(sbm, 0); > + return bit < start ? bit : -1; > +} > + > +static __always_inline > +unsigned int __sbm_for_each_wrap(struct sbm *sbm, int start, int n) > +{ > + int bit; > + > + /* If not wrapped around */ > + if (n > start) { > + /* and have a bit, just return it. */ > + bit = sbm_find_next_bit(sbm, n); > + if (bit >= 0) > + return bit; > + > + /* Otherwise, wrap around and ... */ > + n = 0; > + } > + > + /* Search the other part. */ > + bit = sbm_find_next_bit(sbm, n); > + return bit < start ? bit : -1; > +} > + > #define sbm_for_each_set_bit(sbm, idx) \ > for (int idx = sbm_find_next_bit(sbm, 0); \ > idx >= 0; idx = sbm_find_next_bit(sbm, idx+1)) > > +#define sbm_for_each_set_bit_wrap(sbm, idx, start) \ > + for (int idx = sbm_find_next_bit_wrap(sbm, start); \ > + idx >= 0; idx = __sbm_for_each_wrap(sbm, start, idx+1)) > + > #endif /* _LINUX_SBM_H */ > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index a3a423c4706e..f485afb6286d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -12916,6 +12916,7 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags) > int this_cpu = this_rq->cpu; > int balance_cpu; > struct rq *rq; > + u32 start; > > WARN_ON_ONCE((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); > > @@ -12944,7 +12945,8 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags) > * Start with the next CPU after this_cpu so we will end with this_cpu and let a > * chance for other idle cpu to pull load. > */ > - sbm_for_each_set_bit(nohz.sbm, idx) { > + start = arch_sbm_cpu_to_idx((this_cpu + 1) % nr_cpu_ids); > + sbm_for_each_set_bit_wrap(nohz.sbm, idx, start) { > balance_cpu = arch_sbm_idx_to_cpu(idx); > > if (!idle_cpu(balance_cpu)) > --- > > This is pretty much giving me similar performance as tip for sched > messaging runs under heavy load but your mileage may vary :-) > Thanks very much for providing this optimization. It should help more nohz idle CPUs-beyond just the currently selected ilb_cpu to assist in offloading work. When I applied this patch and reran the test, it appeared to introduce some regressions (underload and overload) compared to the baseline without Peter’s sbm applied. One suspicion is that with sbm enabled(without your patch), more tasks are "aggregated" onto the first CPU(or maybe the front part) in nohz.sbm, because sbm_for_each_set_bit() always picks the first idle CPU to pull work. As we already know, hackbench on our platform strongly prefers being aggregated rather than being spread across different LLCs. So with the spreading fix, the hackbench might be put to different CPUs. Anyway, I'll run more rounds of testing to check whether this is consistent or merely due to run-to-run variance. And I'll try other workloads besides hackbench. Or do you have suggestion on what workload we can try, which is sensitive to nohz cpumask access(I chose hackbench because I found Shrikanth was using hackbench for nohz evaluation in commit 5d86d542f6) thanks, Chenyu CPUs ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-02 3:15 ` Chen, Yu C @ 2026-04-02 4:41 ` K Prateek Nayak 2026-04-02 10:55 ` Peter Zijlstra 0 siblings, 1 reply; 41+ messages in thread From: K Prateek Nayak @ 2026-04-02 4:41 UTC (permalink / raw) To: Chen, Yu C, Peter Zijlstra, Tim Chen Cc: Pan Deng, mingo, linux-kernel, tianyou.li Hello Chenyu, Thank you for testing the changes! Much appreciated. On 4/2/2026 8:45 AM, Chen, Yu C wrote: > One suspicion is that with sbm enabled(without your patch), more > tasks are "aggregated" onto the first CPU(or maybe the front part) > in nohz.sbm, because sbm_for_each_set_bit() always picks the first > idle CPU to pull work. As we already know, hackbench on our > platform strongly prefers being aggregated rather than being > spread across different LLCs. So with the spreading fix, the > hackbench might be put to different CPUs. Ack! But I cannot seem to come up with a theory on why it would be any worse than original. P.S. what does your SBM log in the dmesg look like? On my 3rd Generation EPYC machine (2 x 64C/128T) it looks like: CPU topo: SBM: shift(6) leafs(4) APIC(ff) Now, I suppose I get 4 leaves because I have 128CPUs per socket (2 x u64 per socket) but it is not super how it is achieved from doing: arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1; that divides TOPO_DIE_DOMAIN into two but that should only be okay until 128CPUs per DIE. It is still not super clear to me how the logic deals with more than 128CPUs in a DIE domain because that'll need more than the u64 but sbm_find_next_bit() simply does: tmp = leaf->bitmap & mask; /* All are u64 */ expecting just the u64 bitmap to represent all the CPUs in the leaf. If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask as 7f (127) which allows a leaf to more than 64 CPUs but we are using the "u64 bitmap" directly and not: find_next_bit(bitmap, arch_sbm_mask) Am I missing something here? AMD got the 0x80000026 for defining TOPO_DIE_DOMAIN as soon as we crossed 256CPUs per socket in 4th Generation EPYC so it'll have per CCD (up to 2LLCs) smb leaves but if I'm not mistaken, some of the SPR systems still advertised one large TILE / DIE domain. I'm curious if your test system exposed multiple DIE per PKG since 280 logical CPUs per socket based on the cover letter would still go beyond needing 64 bits if it is advertised as a single DIE. > Anyway, I'll run more > rounds of testing to check whether this is consistent or merely > due to run-to-run variance. And I'll try other workloads besides > hackbench. Or do you have suggestion on what workload we can try, > which is sensitive to nohz cpumask access(I chose hackbench because > I found Shrikanth was using hackbench for nohz evaluation in > commit 5d86d542f6) Most sensitive is schbench's tail latency when system is fully loaded (#workers = #CPUs) but that data point also has large run to run variation - I generally look for crazy jumps like the tail latency turning 5-8x consistently across multiple runs before actually concluding it is a regression. hackbench (/ sched-messaging) should be good enough from a throughput standpoint. -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-02 4:41 ` K Prateek Nayak @ 2026-04-02 10:55 ` Peter Zijlstra 2026-04-02 11:06 ` K Prateek Nayak 0 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2026-04-02 10:55 UTC (permalink / raw) To: K Prateek Nayak Cc: Chen, Yu C, Tim Chen, Pan Deng, mingo, linux-kernel, tianyou.li On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote: > It is still not super clear to me how the logic deals with more than > 128CPUs in a DIE domain because that'll need more than the u64 but > sbm_find_next_bit() simply does: > > tmp = leaf->bitmap & mask; /* All are u64 */ > > expecting just the u64 bitmap to represent all the CPUs in the leaf. > > If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask > as 7f (127) which allows a leaf to more than 64 CPUs but we are > using the "u64 bitmap" directly and not: > > find_next_bit(bitmap, arch_sbm_mask) > > Am I missing something here? Nope. That logic just isn't there, that was left as an exercise to the reader :-) For AMD in particular it would be good to have one leaf per CCD, but since CCD are not enumerated in your topology (they really should be), I didn't do that. Now, I seem to remember we had this discussion in the past some time, and you had some hacks available. Anyway, the whole premise was to have one leaf/cacheline per cache, such that high frequency atomic ops set/clear bit, don't bounce the line around. I took the nohz bitmap, because it was relatively simple and is known to suffer from contention under certain workloads. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-02 10:55 ` Peter Zijlstra @ 2026-04-02 11:06 ` K Prateek Nayak 2026-04-03 5:46 ` Chen, Yu C 0 siblings, 1 reply; 41+ messages in thread From: K Prateek Nayak @ 2026-04-02 11:06 UTC (permalink / raw) To: Peter Zijlstra Cc: Chen, Yu C, Tim Chen, Pan Deng, mingo, linux-kernel, tianyou.li Hello Peter, On 4/2/2026 4:25 PM, Peter Zijlstra wrote: > On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote: > >> It is still not super clear to me how the logic deals with more than >> 128CPUs in a DIE domain because that'll need more than the u64 but >> sbm_find_next_bit() simply does: >> >> tmp = leaf->bitmap & mask; /* All are u64 */ >> >> expecting just the u64 bitmap to represent all the CPUs in the leaf. >> >> If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask >> as 7f (127) which allows a leaf to more than 64 CPUs but we are >> using the "u64 bitmap" directly and not: >> >> find_next_bit(bitmap, arch_sbm_mask) >> >> Am I missing something here? > > Nope. That logic just isn't there, that was left as an exercise to the > reader :-) Ack! Let me go fiddle with that. > > For AMD in particular it would be good to have one leaf per CCD, but > since CCD are not enumerated in your topology (they really should be), I > didn't do that. We got the extended topology leaf 0x80000026 since 4th Generation EPYC and we (well Thomas) added the parser support in v6.10 [1] so we can discover the CCD boundary using that now ;-) https://lore.kernel.org/all/20240314050432.1710-1-kprateek.nayak@amd.com/ > > Now, I seem to remember we had this discussion in the past some time, > and you had some hacks available. That, I believe, was for the NPS boundaries that we don't expose in NPS1 but CCX should be good enough. > > Anyway, the whole premise was to have one leaf/cacheline per cache, such > that high frequency atomic ops set/clear bit, don't bounce the line > around. > > I took the nohz bitmap, because it was relatively simple and is known to > suffer from contention under certain workloads. Ack! It would be better to tie it to the TOPO_TILE_DOMAIN then which maps to the "CCX" on AMD and is the LLC boundary. CCD is just a cluster of CCX that is nearby - mostly the dense core offerings enumerate 2CCX per CCD. -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-02 11:06 ` K Prateek Nayak @ 2026-04-03 5:46 ` Chen, Yu C 2026-04-03 8:13 ` K Prateek Nayak 2026-04-07 20:35 ` Tim Chen 0 siblings, 2 replies; 41+ messages in thread From: Chen, Yu C @ 2026-04-03 5:46 UTC (permalink / raw) To: K Prateek Nayak, Peter Zijlstra Cc: Tim Chen, Pan Deng, mingo, linux-kernel, tianyou.li On 4/2/2026 7:06 PM, K Prateek Nayak wrote: > Hello Peter, > > On 4/2/2026 4:25 PM, Peter Zijlstra wrote: >> On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote: >> >>> It is still not super clear to me how the logic deals with more than >>> 128CPUs in a DIE domain because that'll need more than the u64 but >>> sbm_find_next_bit() simply does: >>> >>> tmp = leaf->bitmap & mask; /* All are u64 */ >>> >>> expecting just the u64 bitmap to represent all the CPUs in the leaf. >>> >>> If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask >>> as 7f (127) which allows a leaf to more than 64 CPUs but we are >>> using the "u64 bitmap" directly and not: >>> >>> find_next_bit(bitmap, arch_sbm_mask) >>> >>> Am I missing something here? >> >> Nope. That logic just isn't there, that was left as an exercise to the >> reader :-) > > Ack! Let me go fiddle with that. > Nice catch. I hadn't noticed this since we have fewer than 64 CPUs per die. Please feel free to send patches to me when they're available. And regarding your other question about the calculation of arch_sbm_shift, I'm trying to understand why there is a subtraction of 1, should it be: - arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1; + arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1]; ? Are we trying to filer the raw global unique die id? - similar to topo_apicid() which mask the lower x86_topo_system.dom_shifts[dom - 1]). With above change I can get a correct value of leaves (4) rather than (2) in the original version. thanks, Chenyu ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-03 5:46 ` Chen, Yu C @ 2026-04-03 8:13 ` K Prateek Nayak 2026-04-07 20:35 ` Tim Chen 1 sibling, 0 replies; 41+ messages in thread From: K Prateek Nayak @ 2026-04-03 8:13 UTC (permalink / raw) To: Chen, Yu C, Peter Zijlstra Cc: Tim Chen, Pan Deng, mingo, linux-kernel, tianyou.li Hello Chenyu, On 4/3/2026 11:16 AM, Chen, Yu C wrote: > On 4/2/2026 7:06 PM, K Prateek Nayak wrote: >> Hello Peter, >> >> On 4/2/2026 4:25 PM, Peter Zijlstra wrote: >>> On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote: >>> >>>> It is still not super clear to me how the logic deals with more than >>>> 128CPUs in a DIE domain because that'll need more than the u64 but >>>> sbm_find_next_bit() simply does: >>>> >>>> tmp = leaf->bitmap & mask; /* All are u64 */ >>>> >>>> expecting just the u64 bitmap to represent all the CPUs in the leaf. >>>> >>>> If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask >>>> as 7f (127) which allows a leaf to more than 64 CPUs but we are >>>> using the "u64 bitmap" directly and not: >>>> >>>> find_next_bit(bitmap, arch_sbm_mask) >>>> >>>> Am I missing something here? >>> >>> Nope. That logic just isn't there, that was left as an exercise to the >>> reader :-) >> >> Ack! Let me go fiddle with that. >> > > Nice catch. I hadn't noticed this since we have fewer than > 64 CPUs per die. Please feel free to send patches to me when > they're available. > > And regarding your other question about the calculation of arch_sbm_shift, > I'm trying to understand why there is a subtraction of 1, should it be: > - arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1; > + arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1]; > ? > Are we trying to filer the raw global unique die id? - similar to topo_apicid() > which mask the lower x86_topo_system.dom_shifts[dom - 1]). > > With above change I can get a correct value of leaves (4) rather than (2) in > the original version. Thanks for confirming. I guess that would just be TOPO_TILE_DOMAIN then and would work well on AMD too since that is where the CCX is mapped. I'll get hold of a SPR / use a VM to confirm with 0x1f behavior. I'll post the patches next week since I have to check with Andrea on how the ARM systems have decided to number their SMT threads and whether they requires separate plumbing for arch_sbm_idx_to_cpu(), arch_sbm_cpu_to_idx() or not. -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-03 5:46 ` Chen, Yu C 2026-04-03 8:13 ` K Prateek Nayak @ 2026-04-07 20:35 ` Tim Chen 2026-04-08 3:06 ` K Prateek Nayak 2026-04-08 9:25 ` Chen, Yu C 1 sibling, 2 replies; 41+ messages in thread From: Tim Chen @ 2026-04-07 20:35 UTC (permalink / raw) To: Chen, Yu C, K Prateek Nayak, Peter Zijlstra Cc: Pan Deng, mingo, linux-kernel, tianyou.li On Fri, 2026-04-03 at 13:46 +0800, Chen, Yu C wrote: > On 4/2/2026 7:06 PM, K Prateek Nayak wrote: > > Hello Peter, > > > > On 4/2/2026 4:25 PM, Peter Zijlstra wrote: > > > On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote: > > > > > > > It is still not super clear to me how the logic deals with more than > > > > 128CPUs in a DIE domain because that'll need more than the u64 but > > > > sbm_find_next_bit() simply does: > > > > > > > > tmp = leaf->bitmap & mask; /* All are u64 */ > > > > > > > > expecting just the u64 bitmap to represent all the CPUs in the leaf. > > > > > > > > If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask > > > > as 7f (127) which allows a leaf to more than 64 CPUs but we are > > > > using the "u64 bitmap" directly and not: > > > > > > > > find_next_bit(bitmap, arch_sbm_mask) > > > > > > > > Am I missing something here? > > > > > > Nope. That logic just isn't there, that was left as an exercise to the > > > reader :-) > > > > Ack! Let me go fiddle with that. > > > > Nice catch. I hadn't noticed this since we have fewer than > 64 CPUs per die. Please feel free to send patches to me when > they're available. > > And regarding your other question about the calculation of arch_sbm_shift, > I'm trying to understand why there is a subtraction of 1, should it be: > - arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1; > + arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1]; Perhaps something like arch_sbm_shift = min(sizeof(unsigned long), topology_get_domain_shift(TOPO_TILE_DOMAIN)); to take care of both AMD system and the 64 bit leaf bitmask limit? Tim > ? > Are we trying to filer the raw global unique die id? - similar to > topo_apicid() > which mask the lower x86_topo_system.dom_shifts[dom - 1]). > > With above change I can get a correct value of leaves (4) rather than (2) in > the original version. > > thanks, > Chenyu > > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-07 20:35 ` Tim Chen @ 2026-04-08 3:06 ` K Prateek Nayak 2026-04-08 11:35 ` Chen, Yu C 2026-04-08 9:25 ` Chen, Yu C 1 sibling, 1 reply; 41+ messages in thread From: K Prateek Nayak @ 2026-04-08 3:06 UTC (permalink / raw) To: Tim Chen, Chen, Yu C, Peter Zijlstra Cc: Pan Deng, mingo, linux-kernel, tianyou.li Hello Tim, On 4/8/2026 2:05 AM, Tim Chen wrote: >> And regarding your other question about the calculation of arch_sbm_shift, >> I'm trying to understand why there is a subtraction of 1, should it be: >> - arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1; >> + arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1]; > > Perhaps something like > > arch_sbm_shift = min(sizeof(unsigned long), > topology_get_domain_shift(TOPO_TILE_DOMAIN)); > > to take care of both AMD system and the 64 bit leaf bitmask limit? Ack! But do we want to separate CPUs on same LLC domain across different cachelines in 64 CPU chunks or should we use the rest of the padding to represent them? I'm collecting some performance numbers to see if makes any difference under high contention but have you seen benefits of sharding the mask further when there are hundreds of CPU on the same LLC? -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-08 3:06 ` K Prateek Nayak @ 2026-04-08 11:35 ` Chen, Yu C 2026-04-08 15:52 ` K Prateek Nayak 0 siblings, 1 reply; 41+ messages in thread From: Chen, Yu C @ 2026-04-08 11:35 UTC (permalink / raw) To: K Prateek Nayak, Tim Chen, Peter Zijlstra Cc: Pan Deng, mingo, linux-kernel, tianyou.li Hello Prateek, On 4/8/2026 11:06 AM, K Prateek Nayak wrote: > Hello Tim, > > On 4/8/2026 2:05 AM, Tim Chen wrote: >>> And regarding your other question about the calculation of arch_sbm_shift, >>> I'm trying to understand why there is a subtraction of 1, should it be: >>> - arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1; >>> + arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1]; >> >> Perhaps something like >> >> arch_sbm_shift = min(sizeof(unsigned long), >> topology_get_domain_shift(TOPO_TILE_DOMAIN)); >> >> to take care of both AMD system and the 64 bit leaf bitmask limit? > > Ack! But do we want to separate CPUs on same LLC domain across > different cachelines in 64 CPU chunks or should we use the rest > of the padding to represent them? > I just saw your email and I had the same question. > I'm collecting some performance numbers to see if makes any > difference under high contention but have you seen benefits of > sharding the mask further when there are hundreds of CPU on the > same LLC? > We haven't tried breaking it down further. One possible approach is to partition it at L2 scope, the benefit of which may depend on the workload. thanks, Chenyu ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-08 11:35 ` Chen, Yu C @ 2026-04-08 15:52 ` K Prateek Nayak 2026-04-09 5:17 ` K Prateek Nayak 0 siblings, 1 reply; 41+ messages in thread From: K Prateek Nayak @ 2026-04-08 15:52 UTC (permalink / raw) To: Chen, Yu C, Tim Chen, Peter Zijlstra Cc: Pan Deng, mingo, linux-kernel, tianyou.li Hello Chenyu, On 4/8/2026 5:05 PM, Chen, Yu C wrote: > We haven't tried breaking it down further. One possible approach > is to partition it at L2 scope, the benefit of which may depend on > the workload. I fear at that point we'll have too many cachelines and too much cache pollution when the CPU starts reading this at tick to schedule a newidle balance. A 128 core system would bring in 128 * 64B = 8kB worth of data to traverse the mask and at that point it becomes a trade off between how fast you want reads vs writes and does it even speed up writes after a certain point? Sorry I got distracted by some other stuff today but I'll share the results from my experiments tomorrow. -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-08 15:52 ` K Prateek Nayak @ 2026-04-09 5:17 ` K Prateek Nayak 2026-04-09 23:09 ` Tim Chen 0 siblings, 1 reply; 41+ messages in thread From: K Prateek Nayak @ 2026-04-09 5:17 UTC (permalink / raw) To: Chen, Yu C, Tim Chen, Peter Zijlstra Cc: Pan Deng, mingo, linux-kernel, tianyou.li Hello Chenyu, Tim, On 4/8/2026 9:22 PM, K Prateek Nayak wrote: > Hello Chenyu, > > On 4/8/2026 5:05 PM, Chen, Yu C wrote: >> We haven't tried breaking it down further. One possible approach >> is to partition it at L2 scope, the benefit of which may depend on >> the workload. > > I fear at that point we'll have too many cachelines and too much > cache pollution when the CPU starts reading this at tick to schedule > a newidle balance. > > A 128 core system would bring in 128 * 64B = 8kB worth of data to > traverse the mask and at that point it becomes a trade off between > how fast you want reads vs writes and does it even speed up writes > after a certain point? > > Sorry I got distracted by some other stuff today but I'll share the > results from my experiments tomorrow. Here is some data from an experiments I ran on a 3rd Generation EPYC system (2 socket x 64C/128T (8LLCs per socket)): Experiment: Two threads pinned per-CPU on all CPUs yielding to each other and are operating on some cpumask - one setting the current CPU on the mask and other clearing the current CPU: Just an estimate of worst case scenario is we have to do one modification per sched-switch. I'm measuring total cycles taken for cpumask operations with following variants: %cycles vs global mask operation global mask : 100.0000% (var: 3.28%) per-NUMA mask : 32.9209% (var: 7.77%) per-LLC mask : 1.2977% (var: 4.85%) per-LLC mask (u8 operation; no LOCK prefix) : 0.4930% (var: 0.83%) per-NUMA split is 3X faster, per-LLC on this 16LLC machine is 77x faster and since there is enough space in the cacheline we can use a u8 to set and clear the CPu atomically without LOCK prefix and then do a >> 3 to get the CPU index from set bit which is 202x faster. If we use the u8 operations, we can only read 8CPUs per 8-byte load on 64-bit system but with per-LLC mask, we can scan all 16CPUs on the LLC with one 8-byte read and and per-NUMA one requires two 8-byte reads to scan the 128CPUs per socket. I think per-LLC mask (or, as Tim suggested, 64CPUs per cacheline) is a good tradeoff between the speedup vs amount of loads required to piece together the full cpumask. Thoughts? -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-09 5:17 ` K Prateek Nayak @ 2026-04-09 23:09 ` Tim Chen 2026-04-10 5:51 ` Chen, Yu C 0 siblings, 1 reply; 41+ messages in thread From: Tim Chen @ 2026-04-09 23:09 UTC (permalink / raw) To: K Prateek Nayak, Chen, Yu C, Peter Zijlstra Cc: Pan Deng, mingo, linux-kernel, tianyou.li On Thu, 2026-04-09 at 10:47 +0530, K Prateek Nayak wrote: > Hello Chenyu, Tim, > > On 4/8/2026 9:22 PM, K Prateek Nayak wrote: > > Hello Chenyu, > > > > On 4/8/2026 5:05 PM, Chen, Yu C wrote: > > > We haven't tried breaking it down further. One possible approach > > > is to partition it at L2 scope, the benefit of which may depend on > > > the workload. > > > > I fear at that point we'll have too many cachelines and too much > > cache pollution when the CPU starts reading this at tick to schedule > > a newidle balance. > > > > A 128 core system would bring in 128 * 64B = 8kB worth of data to > > traverse the mask and at that point it becomes a trade off between > > how fast you want reads vs writes and does it even speed up writes > > after a certain point? > > > > Sorry I got distracted by some other stuff today but I'll share the > > results from my experiments tomorrow. > > Here is some data from an experiments I ran on a 3rd Generation EPYC > system (2 socket x 64C/128T (8LLCs per socket)): > > Experiment: Two threads pinned per-CPU on all CPUs yielding to each other > and are operating on some cpumask - one setting the current CPU on the > mask and other clearing the current CPU: Just an estimate of worst case > scenario is we have to do one modification per sched-switch. > > I'm measuring total cycles taken for cpumask operations with following > variants: > > %cycles vs global mask operation > > global mask : 100.0000% (var: 3.28%) > per-NUMA mask : 32.9209% (var: 7.77%) > per-LLC mask : 1.2977% (var: 4.85%) > per-LLC mask (u8 operation; no LOCK prefix) : 0.4930% (var: 0.83%) > > per-NUMA split is 3X faster, per-LLC on this 16LLC machine is 77x faster > and since there is enough space in the cacheline we can use a u8 to set > and clear the CPu atomically without LOCK prefix and then do a >> 3 to > get the CPU index from set bit which is 202x faster. > > If we use the u8 operations, we can only read 8CPUs per 8-byte load on > 64-bit system but with per-LLC mask, we can scan all 16CPUs on the LLC > with one 8-byte read and and per-NUMA one requires two 8-byte reads to > scan the 128CPUs per socket. > > I think per-LLC mask (or, as Tim suggested, 64CPUs per cacheline) is > a good tradeoff between the speedup vs amount of loads required to > piece together the full cpumask. Thoughts? I agree that per-LLC mask is a good compromise between minimizing loads and offer good speed ups. I think we should get the LLC APICID mask from 0x4 leaf (L1, L2, L3) instead of inferring from 0x1f leaf (Tile, Die ...etc) for Intel. And the cache leaf I think is 0x8000_001D leaf for AMD. Those are parsed in cacheinfo code and we can get it from there. Tim ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-09 23:09 ` Tim Chen @ 2026-04-10 5:51 ` Chen, Yu C 2026-04-10 6:02 ` K Prateek Nayak 0 siblings, 1 reply; 41+ messages in thread From: Chen, Yu C @ 2026-04-10 5:51 UTC (permalink / raw) To: Tim Chen, K Prateek Nayak, Peter Zijlstra Cc: Pan Deng, mingo, linux-kernel, tianyou.li Hi Prateek, Tim, On 4/10/2026 7:09 AM, Tim Chen wrote: > On Thu, 2026-04-09 at 10:47 +0530, K Prateek Nayak wrote: >> Hello Chenyu, Tim, >> >> On 4/8/2026 9:22 PM, K Prateek Nayak wrote: >>> Hello Chenyu, >>> >>> On 4/8/2026 5:05 PM, Chen, Yu C wrote: >>>> We haven't tried breaking it down further. One possible approach >>>> is to partition it at L2 scope, the benefit of which may depend on >>>> the workload. >>> >>> I fear at that point we'll have too many cachelines and too much >>> cache pollution when the CPU starts reading this at tick to schedule >>> a newidle balance. >>> >>> A 128 core system would bring in 128 * 64B = 8kB worth of data to >>> traverse the mask and at that point it becomes a trade off between >>> how fast you want reads vs writes and does it even speed up writes >>> after a certain point? >>> >>> Sorry I got distracted by some other stuff today but I'll share the >>> results from my experiments tomorrow. >> >> Here is some data from an experiments I ran on a 3rd Generation EPYC >> system (2 socket x 64C/128T (8LLCs per socket)): >> >> Experiment: Two threads pinned per-CPU on all CPUs yielding to each other >> and are operating on some cpumask - one setting the current CPU on the >> mask and other clearing the current CPU: Just an estimate of worst case >> scenario is we have to do one modification per sched-switch. >> >> I'm measuring total cycles taken for cpumask operations with following >> variants: >> >> %cycles vs global mask operation >> >> global mask : 100.0000% (var: 3.28%) >> per-NUMA mask : 32.9209% (var: 7.77%) >> per-LLC mask : 1.2977% (var: 4.85%) >> per-LLC mask (u8 operation; no LOCK prefix) : 0.4930% (var: 0.83%) >> >> per-NUMA split is 3X faster, per-LLC on this 16LLC machine is 77x faster >> and since there is enough space in the cacheline we can use a u8 to set >> and clear the CPu atomically without LOCK prefix and then do a >> 3 to >> get the CPU index from set bit which is 202x faster. >> >> If we use the u8 operations, we can only read 8CPUs per 8-byte load on >> 64-bit system but with per-LLC mask, we can scan all 16CPUs on the LLC >> with one 8-byte read and and per-NUMA one requires two 8-byte reads to >> scan the 128CPUs per socket. >> >> I think per-LLC mask (or, as Tim suggested, 64CPUs per cacheline) is >> a good tradeoff between the speedup vs amount of loads required to >> piece together the full cpumask. Thoughts? Yes, making it per LLC should work well enough (for balancing) to achieve optimal benefit. Let me run some similar tests to yours,plus hackbench/schbench, to see what the results are. BTW, on AMD systems, does the TILE domain always match the CCX where L3 is shared? On Intel the DIE is not always mapped to a domain where L3 is shared. > > I agree that per-LLC mask is a good compromise between minimizing loads > and offer good speed ups. I think we should get the LLC APICID > mask from 0x4 leaf (L1, L2, L3) instead of inferring from 0x1f leaf (Tile, Die ...etc) > for Intel. And the cache leaf I think is 0x8000_001D leaf for AMD. > Those are parsed in cacheinfo code and we can get it from there. > Yes, let me check how we can leverage the l3 id for that. thanks, Chenyu ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-10 5:51 ` Chen, Yu C @ 2026-04-10 6:02 ` K Prateek Nayak 0 siblings, 0 replies; 41+ messages in thread From: K Prateek Nayak @ 2026-04-10 6:02 UTC (permalink / raw) To: Chen, Yu C, Tim Chen, Peter Zijlstra Cc: Pan Deng, mingo, linux-kernel, tianyou.li Hello Chenyu, Tim, On 4/10/2026 11:21 AM, Chen, Yu C wrote: >>> I think per-LLC mask (or, as Tim suggested, 64CPUs per cacheline) is >>> a good tradeoff between the speedup vs amount of loads required to >>> piece together the full cpumask. Thoughts? > > Yes, making it per LLC should work well enough (for balancing) to > achieve optimal benefit. Let me run some similar tests to yours,plus > hackbench/schbench, to see what the results are. > BTW, on AMD systems, does the TILE domain always match the CCX where > L3 is shared? On Intel the DIE is not always mapped to a domain > where L3 is shared. On AMD platforms that support the extended leaf 0x80000026, CCX is always mapped to L3 and matched the data on 0x8000001D cache property leaf for L3. > >> >> I agree that per-LLC mask is a good compromise between minimizing loads >> and offer good speed ups. I think we should get the LLC APICID >> mask from 0x4 leaf (L1, L2, L3) instead of inferring from 0x1f leaf (Tile, Die ...etc) >> for Intel. And the cache leaf I think is 0x8000_001D leaf for AMD. >> Those are parsed in cacheinfo code and we can get it from there. >> > > Yes, let me check how we can leverage the l3 id for that. Ack! I think the cacheinfo is better for all this and is also compatible with older systems that may nit have the extend topology enumeration leaf. AMD only got it two generations ago and until that only cache property leaf was used for marking the LLC (CCX) boundary. -- Thanks and Regards, Prateek ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-07 20:35 ` Tim Chen 2026-04-08 3:06 ` K Prateek Nayak @ 2026-04-08 9:25 ` Chen, Yu C 2026-04-08 16:47 ` Tim Chen 1 sibling, 1 reply; 41+ messages in thread From: Chen, Yu C @ 2026-04-08 9:25 UTC (permalink / raw) To: Tim Chen Cc: Pan Deng, mingo, linux-kernel, tianyou.li, K Prateek Nayak, Peter Zijlstra On 4/8/2026 4:35 AM, Tim Chen wrote: > On Fri, 2026-04-03 at 13:46 +0800, Chen, Yu C wrote: >> On 4/2/2026 7:06 PM, K Prateek Nayak wrote: >>> Hello Peter, >>> >>> On 4/2/2026 4:25 PM, Peter Zijlstra wrote: >>>> On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote: >>>> >>>>> It is still not super clear to me how the logic deals with more than >>>>> 128CPUs in a DIE domain because that'll need more than the u64 but >>>>> sbm_find_next_bit() simply does: >>>>> >>>>> tmp = leaf->bitmap & mask; /* All are u64 */ >>>>> >>>>> expecting just the u64 bitmap to represent all the CPUs in the leaf. >>>>> >>>>> If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask >>>>> as 7f (127) which allows a leaf to more than 64 CPUs but we are >>>>> using the "u64 bitmap" directly and not: >>>>> >>>>> find_next_bit(bitmap, arch_sbm_mask) >>>>> >>>>> Am I missing something here? >>>> >>>> Nope. That logic just isn't there, that was left as an exercise to the >>>> reader :-) >>> >>> Ack! Let me go fiddle with that. >>> >> >> Nice catch. I hadn't noticed this since we have fewer than >> 64 CPUs per die. Please feel free to send patches to me when >> they're available. >> >> And regarding your other question about the calculation of arch_sbm_shift, >> I'm trying to understand why there is a subtraction of 1, should it be: >> - arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1; >> + arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1]; > > Perhaps something like > > arch_sbm_shift = min(sizeof(unsigned long), > topology_get_domain_shift(TOPO_TILE_DOMAIN)); > > to take care of both AMD system and the 64 bit leaf bitmask limit? > Yes, this should be doable (Prateek has mentioned using TOPO_TILE_DOMAIN). The only drawback I can think of is that if there are more than 64 CPUs within a die, it is possible CPUs in different dies (LLCs) be indexed in the same leaf and access the same mask, which would still lead to cache contention. Maybe we should allocate the leaf cpumask according to the actual size of a die? thanks, Chenyu ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention 2026-04-08 9:25 ` Chen, Yu C @ 2026-04-08 16:47 ` Tim Chen 0 siblings, 0 replies; 41+ messages in thread From: Tim Chen @ 2026-04-08 16:47 UTC (permalink / raw) To: Chen, Yu C Cc: Pan Deng, mingo, linux-kernel, tianyou.li, K Prateek Nayak, Peter Zijlstra On Wed, 2026-04-08 at 17:25 +0800, Chen, Yu C wrote: > On 4/8/2026 4:35 AM, Tim Chen wrote: > > On Fri, 2026-04-03 at 13:46 +0800, Chen, Yu C wrote: > > > On 4/2/2026 7:06 PM, K Prateek Nayak wrote: > > > > Hello Peter, > > > > > > > > On 4/2/2026 4:25 PM, Peter Zijlstra wrote: > > > > > On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote: > > > > > > > > > > > It is still not super clear to me how the logic deals with more than > > > > > > 128CPUs in a DIE domain because that'll need more than the u64 but > > > > > > sbm_find_next_bit() simply does: > > > > > > > > > > > > tmp = leaf->bitmap & mask; /* All are u64 */ > > > > > > > > > > > > expecting just the u64 bitmap to represent all the CPUs in the leaf. > > > > > > > > > > > > If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask > > > > > > as 7f (127) which allows a leaf to more than 64 CPUs but we are > > > > > > using the "u64 bitmap" directly and not: > > > > > > > > > > > > find_next_bit(bitmap, arch_sbm_mask) > > > > > > > > > > > > Am I missing something here? > > > > > > > > > > Nope. That logic just isn't there, that was left as an exercise to the > > > > > reader :-) > > > > > > > > Ack! Let me go fiddle with that. > > > > > > > > > > Nice catch. I hadn't noticed this since we have fewer than > > > 64 CPUs per die. Please feel free to send patches to me when > > > they're available. > > > > > > And regarding your other question about the calculation of arch_sbm_shift, > > > I'm trying to understand why there is a subtraction of 1, should it be: > > > - arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1; > > > + arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN - 1]; > > > > Perhaps something like > > > > arch_sbm_shift = min(sizeof(unsigned long), > > topology_get_domain_shift(TOPO_TILE_DOMAIN)); > > > > to take care of both AMD system and the 64 bit leaf bitmask limit? > > > > Yes, this should be doable (Prateek has mentioned using TOPO_TILE_DOMAIN). > The only drawback I can think of is that if there are more than 64 CPUs > within a die, it is possible CPUs in different dies (LLCs) be indexed in > the same leaf and access the same mask, > First, I think I should have used arch_sbm_shift = min(BITS_PER_LONG, topology_get_domain_shift(TOPO_TILE_DOMAIN)); I am assuming that we should choose TOPO_DIE_DOMAIN for Intel CPUs and TOPO_TILE_DOMAIN for AMD CPUs. And the assumption is that such domain choice will span one L3 (I think that's the case). Then leaf domains smaller than the domain size will also only span one L3 by definition. So for the 128 CPUs example you gave, both leaves with CPU 0-63 and 64-127 will span the same LLC and we should not have cache bounce. Tim > which would still lead to cache > contention. Maybe we should allocate the leaf cpumask according to the > actual size of a die? > > thanks, > Chenyu > > ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention 2025-07-21 6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng ` (3 preceding siblings ...) 2025-07-21 6:10 ` [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng @ 2026-03-20 9:59 ` Peter Zijlstra 2026-03-20 12:50 ` Peter Zijlstra 4 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2026-03-20 9:59 UTC (permalink / raw) To: Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen On Mon, Jul 21, 2025 at 02:10:22PM +0800, Pan Deng wrote: > When running multi-instance FFmpeg workload in cloud environment, > cache line contention is severe during the access to root_domain data > structures, which significantly degrades performance. > > The SUT is a 2-socket machine with 240 physical cores and 480 logical What's a SUT? > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 > with FIFO scheduling. FPS(frame per second) is used as score. So I think we can do some of this, but that workload is hilariously poorly configured. You're pinning things but not partitioning, why? If you would have created 60 partitions, one for each FFmpeg thingy, then you wouldn't have needed any of this. You're running at FIFO99 (IOW prio-0) and then claiming prio-0 is used more heavily than others... will d0h. What priority assignment scheme led to this? Is there a sensible reason these must be 99? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention 2026-03-20 9:59 ` [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Peter Zijlstra @ 2026-03-20 12:50 ` Peter Zijlstra 0 siblings, 0 replies; 41+ messages in thread From: Peter Zijlstra @ 2026-03-20 12:50 UTC (permalink / raw) To: Pan Deng; +Cc: mingo, linux-kernel, tianyou.li, tim.c.chen, yu.c.chen On Fri, Mar 20, 2026 at 10:59:55AM +0100, Peter Zijlstra wrote: > On Mon, Jul 21, 2025 at 02:10:22PM +0800, Pan Deng wrote: > > When running multi-instance FFmpeg workload in cloud environment, > > cache line contention is severe during the access to root_domain data > > structures, which significantly degrades performance. > > > > The SUT is a 2-socket machine with 240 physical cores and 480 logical > > What's a SUT? > > > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores > > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 > > with FIFO scheduling. FPS(frame per second) is used as score. > > So I think we can do some of this, but that workload is hilariously > poorly configured. > > You're pinning things but not partitioning, why? If you would have > created 60 partitions, one for each FFmpeg thingy, then you wouldn't > have needed any of this. > > You're running at FIFO99 (IOW prio-0) and then claiming prio-0 is used > more heavily than others... will d0h. What priority assignment scheme > led to this? Is there a sensible reason these must be 99? > Also, you failed the most basic of tasks, Cc all the relevant people. I would've hoped at least some of the 'reviewer' you had would've told you about that. Notably, Steve is the one that often looks after this RT stuff. ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2026-04-10 6:02 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-21 6:10 [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng 2025-07-21 6:10 ` [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Pan Deng 2026-03-20 10:09 ` Peter Zijlstra 2026-03-24 9:36 ` Deng, Pan 2026-03-24 12:11 ` Peter Zijlstra 2026-03-27 10:17 ` Deng, Pan 2026-04-02 10:37 ` Deng, Pan 2026-04-02 10:43 ` Peter Zijlstra 2026-04-08 10:16 ` Chen, Yu C 2026-04-09 11:47 ` Deng, Pan 2025-07-21 6:10 ` [PATCH v2 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng 2026-03-20 10:18 ` Peter Zijlstra 2025-07-21 6:10 ` [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng 2026-03-20 10:24 ` Peter Zijlstra 2026-03-23 18:09 ` Tim Chen 2026-03-24 12:16 ` Peter Zijlstra 2026-03-24 22:40 ` Tim Chen 2025-07-21 6:10 ` [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng 2026-03-20 12:40 ` Peter Zijlstra 2026-03-23 18:45 ` Tim Chen 2026-03-24 12:00 ` Peter Zijlstra 2026-03-31 5:37 ` Chen, Yu C 2026-03-31 10:19 ` K Prateek Nayak 2026-04-02 3:15 ` Chen, Yu C 2026-04-02 4:41 ` K Prateek Nayak 2026-04-02 10:55 ` Peter Zijlstra 2026-04-02 11:06 ` K Prateek Nayak 2026-04-03 5:46 ` Chen, Yu C 2026-04-03 8:13 ` K Prateek Nayak 2026-04-07 20:35 ` Tim Chen 2026-04-08 3:06 ` K Prateek Nayak 2026-04-08 11:35 ` Chen, Yu C 2026-04-08 15:52 ` K Prateek Nayak 2026-04-09 5:17 ` K Prateek Nayak 2026-04-09 23:09 ` Tim Chen 2026-04-10 5:51 ` Chen, Yu C 2026-04-10 6:02 ` K Prateek Nayak 2026-04-08 9:25 ` Chen, Yu C 2026-04-08 16:47 ` Tim Chen 2026-03-20 9:59 ` [PATCH v2 0/4] sched/rt: mitigate root_domain cache line contention Peter Zijlstra 2026-03-20 12:50 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox