From: Pan Deng <pan.deng@intel.com>
To: peterz@infradead.org, mingo@kernel.org
Cc: linux-kernel@vger.kernel.org, tianyou.li@intel.com,
tim.c.chen@linux.intel.com, yu.c.chen@intel.com,
pan.deng@intel.com
Subject: [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
Date: Mon, 7 Jul 2025 10:35:25 +0800 [thread overview]
Message-ID: <c3fa01bed2f875293ac65425c75a322e8e70e1d3.1751852370.git.pan.deng@intel.com> (raw)
In-Reply-To: <cover.1751852370.git.pan.deng@intel.com>
When running a multi-instance FFmpeg workload on an HCC system, significant
cache line contention is observed around `cpupri_vec->count` and `mask` in
struct root_domain.
The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.
perf c2c tool reveals:
root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored
and contends with other fields, since counts[0] is more frequently
updated than others along with a rt task enqueues an empty runq or
dequeues from a non-overloaded runq.
- cycles per load: ~10K to 59K
cpupri's last cache line:
- `cpupri_vec->count` and `mask` contends. The transcoding threads use
rt pri 99, so that the contention occurs in the end.
- cycles per load: ~1.5K to 10.5K
This change mitigates `cpupri_vec->count`, `mask` related contentions by
separating each count and mask into different cache lines.
As a result:
- FPS improves by ~11%
- Kernel cycles% drops from ~20% to ~11%
- `count` and `mask` related cache line contention is mitigated, perf c2c
shows root_domain cache line 3 `cycles per load` drops from ~10K-59K
to ~0.5K-8K, cpupri's last cache line no longer appears in the report.
Note: The side effect of this change is that struct cpupri size is
increased from 26 cache lines to 203 cache lines.
An alternative approach could be separating `counts` and `masks` into 2
vectors in cpupri_vec (counts[] and masks[]), and add two paddings:
1. Between counts[0] and counts[1], since counts[0] is more frequently
updated than others.
2. Between the two vectors, since counts[] is read-write access while
masks[] is read access when it stores pointers.
The alternative approach introduces the complexity of 31+/21- LoC changes,
it achieves almost the same performance, at the same time, struct cpupri
size is reduced from 26 cache lines to 21 cache lines.
Appendix:
1. Current layout of contended data structures:
struct root_domain {
atomic_t refcount; /* 0 4 */
atomic_t rto_count; /* 4 4 */
struct callback_head rcu __attribute__((__aligned__(8)));/*8 16 */
cpumask_var_t span; /* 24 8 */
cpumask_var_t online; /* 32 8 */
bool overloaded; /* 40 1 */
bool overutilized; /* 41 1 */
/* XXX 6 bytes hole, try to pack */
cpumask_var_t dlo_mask; /* 48 8 */
atomic_t dlo_count; /* 56 4 */
/* XXX 4 bytes hole, try to pack */
/* --- cacheline 1 boundary (64 bytes) --- */
struct dl_bw dl_bw; /* 64 24 */
struct cpudl cpudl; /* 88 24 */
u64 visit_gen; /* 112 8 */
struct irq_work rto_push_work; /* 120 32 */
/* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
raw_spinlock_t rto_lock; /* 152 4 */
int rto_loop; /* 156 4 */
int rto_cpu; /* 160 4 */
atomic_t rto_loop_next; /* 164 4 */
atomic_t rto_loop_start; /* 168 4 */
/* XXX 4 bytes hole, try to pack */
cpumask_var_t rto_mask; /* 176 8 */
struct cpupri cpupri; /* 184 1624 */
/* --- cacheline 28 boundary (1792 bytes) was 16 bytes ago --- */
struct perf_domain * pd; /* 1808 8 */
/* size: 1816, cachelines: 29, members: 21 */
/* sum members: 1802, holes: 3, sum holes: 14 */
/* forced alignments: 1 */
/* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));
struct cpupri {
struct cpupri_vec pri_to_cpu[101]; /* 0 1616 */
/* --- cacheline 25 boundary (1600 bytes) was 16 bytes ago --- */
int * cpu_to_pri; /* 1616 8 */
/* size: 1624, cachelines: 26, members: 2 */
/* last cacheline: 24 bytes */
};
struct cpupri_vec {
atomic_t count; /* 0 4 */
/* XXX 4 bytes hole, try to pack */
cpumask_var_t mask; /* 8 8 */
/* size: 16, cachelines: 1, members: 2 */
/* sum members: 12, holes: 1, sum holes: 4 */
/* last cacheline: 16 bytes */
};
2. Perf c2c report of root_domain cache line 3:
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
353 44 62 0xff14d42c400e3880
------- ------- ------ ------ ------ ------ ------------------------
0.00% 2.27% 0.00% 0x0 21683 6 __flush_smp_call_function_
0.00% 2.27% 0.00% 0x0 22294 5 __flush_smp_call_function_
0.28% 0.00% 0.00% 0x0 0 2 irq_work_queue_on
0.28% 0.00% 0.00% 0x0 27824 4 irq_work_single
0.00% 0.00% 1.61% 0x0 28151 6 irq_work_queue_on
0.57% 0.00% 0.00% 0x18 21822 8 native_queued_spin_lock_sl
0.28% 2.27% 0.00% 0x18 16101 10 native_queued_spin_lock_sl
0.57% 0.00% 0.00% 0x18 33199 5 native_queued_spin_lock_sl
0.00% 0.00% 1.61% 0x18 10908 32 _raw_spin_lock
0.00% 0.00% 1.61% 0x18 59770 2 _raw_spin_lock
0.00% 0.00% 1.61% 0x18 0 1 _raw_spin_unlock
1.42% 0.00% 0.00% 0x20 12918 20 pull_rt_task
0.85% 0.00% 25.81% 0x24 31123 199 pull_rt_task
0.85% 0.00% 3.23% 0x24 38218 24 pull_rt_task
0.57% 4.55% 19.35% 0x28 30558 207 pull_rt_task
0.28% 0.00% 0.00% 0x28 55504 10 pull_rt_task
18.70% 18.18% 0.00% 0x30 26438 291 dequeue_pushable_task
17.28% 22.73% 0.00% 0x30 29347 281 enqueue_pushable_task
1.70% 2.27% 0.00% 0x30 12819 31 enqueue_pushable_task
0.28% 0.00% 0.00% 0x30 17726 18 dequeue_pushable_task
34.56% 29.55% 0.00% 0x38 25509 527 cpupri_find_fitness
13.88% 11.36% 24.19% 0x38 30654 342 cpupri_set
3.12% 2.27% 0.00% 0x38 18093 39 cpupri_set
1.70% 0.00% 0.00% 0x38 37661 52 cpupri_find_fitness
1.42% 2.27% 19.35% 0x38 31110 211 cpupri_set
1.42% 0.00% 1.61% 0x38 45035 31 cpupri_set
3. Perf c2c report of cpupri's last cache line
------- ------- ------ ------ ------ ------ ------------------------
Rmt Lcl Store Data Load Total Symbol
Hitm% Hitm% L1 Hit% offset cycles records
------- ------- ------ ------ ------ ------ ------------------------
149 43 41 0xff14d42c400e3ec0
------- ------- ------ ------ ------ ------ ------------------------
8.72% 11.63% 0.00% 0x8 2001 165 cpupri_find_fitness
1.34% 2.33% 0.00% 0x18 1456 151 cpupri_find_fitness
8.72% 9.30% 58.54% 0x28 1744 263 cpupri_set
2.01% 4.65% 41.46% 0x28 1958 301 cpupri_set
1.34% 0.00% 0.00% 0x28 10580 6 cpupri_set
69.80% 67.44% 0.00% 0x30 1754 347 cpupri_set
8.05% 4.65% 0.00% 0x30 2144 256 cpupri_set
Signed-off-by: Pan Deng <pan.deng@intel.com>
Signed-off-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/cpupri.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..245b0fa626be 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -9,7 +9,7 @@
struct cpupri_vec {
atomic_t count;
- cpumask_var_t mask;
+ cpumask_var_t mask ____cacheline_aligned;
};
struct cpupri {
--
2.43.5
next prev parent reply other threads:[~2025-07-07 2:31 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-07 2:35 [PATCH 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
2025-07-07 2:35 ` Pan Deng [this message]
2025-09-01 5:10 ` [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Chen, Yu C
2025-09-01 13:24 ` Deng, Pan
2025-07-07 2:35 ` [PATCH 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
2025-07-07 2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
2025-07-07 6:53 ` kernel test robot
2025-07-07 11:36 ` Deng, Pan
2025-07-07 6:53 ` kernel test robot
2025-07-08 5:33 ` kernel test robot
2025-07-08 14:02 ` Deng, Pan
2025-07-09 8:56 ` Li, Philip
2025-07-07 2:35 ` [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
2025-07-21 11:23 ` Chen, Yu C
2025-07-22 14:46 ` Deng, Pan
2025-08-06 14:00 ` Deng, Pan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c3fa01bed2f875293ac65425c75a322e8e70e1d3.1751852370.git.pan.deng@intel.com \
--to=pan.deng@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=tianyou.li@intel.com \
--cc=tim.c.chen@linux.intel.com \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).