[PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Pan Deng <pan.deng@intel.com>
To: peterz@infradead.org, mingo@kernel.org
Cc: linux-kernel@vger.kernel.org, tianyou.li@intel.com,
	tim.c.chen@linux.intel.com, yu.c.chen@intel.com,
	pan.deng@intel.com
Subject: [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
Date: Mon,  7 Jul 2025 10:35:25 +0800	[thread overview]
Message-ID: <c3fa01bed2f875293ac65425c75a322e8e70e1d3.1751852370.git.pan.deng@intel.com> (raw)
In-Reply-To: <cover.1751852370.git.pan.deng@intel.com>

When running a multi-instance FFmpeg workload on an HCC system, significant
cache line contention is observed around `cpupri_vec->count` and `mask` in
struct root_domain.

The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.

perf c2c tool reveals:
root_domain cache line 3:
- `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored
   and contends with other fields, since counts[0] is more frequently
   updated than others along with a rt task enqueues an empty runq or
   dequeues from a non-overloaded runq.
- cycles per load: ~10K to 59K

cpupri's last cache line:
- `cpupri_vec->count` and `mask` contends. The transcoding threads use
  rt pri 99, so that the contention occurs in the end.
- cycles per load: ~1.5K to 10.5K

This change mitigates `cpupri_vec->count`, `mask` related contentions by
separating each count and mask into different cache lines.

As a result:
- FPS improves by ~11%
- Kernel cycles% drops from ~20% to ~11%
- `count` and `mask` related cache line contention is mitigated, perf c2c
  shows root_domain cache line 3 `cycles per load` drops from ~10K-59K
  to ~0.5K-8K, cpupri's last cache line no longer appears in the report.

Note: The side effect of this change is that struct cpupri size is
increased from 26 cache lines to 203 cache lines.

An alternative approach could be separating `counts` and `masks` into 2
vectors in cpupri_vec (counts[] and masks[]), and add two paddings:
1. Between counts[0] and counts[1], since counts[0] is more frequently
   updated than others.
2. Between the two vectors, since counts[] is read-write access  while
   masks[] is read access when it stores pointers.

The alternative approach introduces the complexity of 31+/21- LoC changes,
it achieves almost the same performance, at the same time, struct cpupri
size is reduced from 26 cache lines to 21 cache lines.

Appendix:
1. Current layout of contended data structures:
struct root_domain {
    atomic_t                   refcount;             /*     0     4 */
    atomic_t                   rto_count;            /*     4     4 */
    struct callback_head rcu __attribute__((__aligned__(8)));/*8 16 */
    cpumask_var_t              span;                 /*    24     8 */
    cpumask_var_t              online;               /*    32     8 */
    bool                       overloaded;           /*    40     1 */
    bool                       overutilized;         /*    41     1 */
    /* XXX 6 bytes hole, try to pack */
    cpumask_var_t              dlo_mask;             /*    48     8 */
    atomic_t                   dlo_count;            /*    56     4 */
    /* XXX 4 bytes hole, try to pack */
    /* --- cacheline 1 boundary (64 bytes) --- */
    struct dl_bw               dl_bw;                /*    64    24 */
    struct cpudl               cpudl;                /*    88    24 */
    u64                        visit_gen;            /*   112     8 */
    struct irq_work            rto_push_work;        /*   120    32 */
    /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
    raw_spinlock_t             rto_lock;             /*   152     4 */
    int                        rto_loop;             /*   156     4 */
    int                        rto_cpu;              /*   160     4 */
    atomic_t                   rto_loop_next;        /*   164     4 */
    atomic_t                   rto_loop_start;       /*   168     4 */
    /* XXX 4 bytes hole, try to pack */
    cpumask_var_t              rto_mask;             /*   176     8 */
    struct cpupri              cpupri;               /*   184  1624 */
    /* --- cacheline 28 boundary (1792 bytes) was 16 bytes ago --- */
    struct perf_domain *       pd;                   /*  1808     8 */
    /* size: 1816, cachelines: 29, members: 21 */
    /* sum members: 1802, holes: 3, sum holes: 14 */
    /* forced alignments: 1 */
    /* last cacheline: 24 bytes */
} __attribute__((__aligned__(8)));

struct cpupri {
        struct cpupri_vec          pri_to_cpu[101];      /*     0  1616 */
        /* --- cacheline 25 boundary (1600 bytes) was 16 bytes ago --- */
        int *                      cpu_to_pri;           /*  1616     8 */

        /* size: 1624, cachelines: 26, members: 2 */
        /* last cacheline: 24 bytes */
};

struct cpupri_vec {
        atomic_t                   count;                /*     0     4 */

        /* XXX 4 bytes hole, try to pack */

        cpumask_var_t              mask;                 /*     8     8 */

        /* size: 16, cachelines: 1, members: 2 */
        /* sum members: 12, holes: 1, sum holes: 4 */
        /* last cacheline: 16 bytes */
};

2. Perf c2c report of root_domain cache line 3:
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 353       44       62    0xff14d42c400e3880
-------  -------  ------  ------  ------  ------  ------------------------
 0.00%    2.27%    0.00%  0x0     21683   6     __flush_smp_call_function_
 0.00%    2.27%    0.00%  0x0     22294   5     __flush_smp_call_function_
 0.28%    0.00%    0.00%  0x0     0       2     irq_work_queue_on
 0.28%    0.00%    0.00%  0x0     27824   4     irq_work_single
 0.00%    0.00%    1.61%  0x0     28151   6     irq_work_queue_on
 0.57%    0.00%    0.00%  0x18    21822   8     native_queued_spin_lock_sl
 0.28%    2.27%    0.00%  0x18    16101   10    native_queued_spin_lock_sl
 0.57%    0.00%    0.00%  0x18    33199   5     native_queued_spin_lock_sl
 0.00%    0.00%    1.61%  0x18    10908   32    _raw_spin_lock
 0.00%    0.00%    1.61%  0x18    59770   2     _raw_spin_lock
 0.00%    0.00%    1.61%  0x18    0       1     _raw_spin_unlock
 1.42%    0.00%    0.00%  0x20    12918   20    pull_rt_task
 0.85%    0.00%   25.81%  0x24    31123   199   pull_rt_task
 0.85%    0.00%    3.23%  0x24    38218   24    pull_rt_task
 0.57%    4.55%   19.35%  0x28    30558   207   pull_rt_task
 0.28%    0.00%    0.00%  0x28    55504   10    pull_rt_task
18.70%   18.18%    0.00%  0x30    26438   291   dequeue_pushable_task
17.28%   22.73%    0.00%  0x30    29347   281   enqueue_pushable_task
 1.70%    2.27%    0.00%  0x30    12819   31    enqueue_pushable_task
 0.28%    0.00%    0.00%  0x30    17726   18    dequeue_pushable_task
34.56%   29.55%    0.00%  0x38    25509   527   cpupri_find_fitness
13.88%   11.36%   24.19%  0x38    30654   342   cpupri_set
 3.12%    2.27%    0.00%  0x38    18093   39    cpupri_set
 1.70%    0.00%    0.00%  0x38    37661   52    cpupri_find_fitness
 1.42%    2.27%   19.35%  0x38    31110   211   cpupri_set
 1.42%    0.00%    1.61%  0x38    45035   31    cpupri_set

3. Perf c2c report of cpupri's last cache line
-------  -------  ------  ------  ------  ------  ------------------------
 Rmt      Lcl     Store   Data    Load    Total    Symbol
Hitm%    Hitm%   L1 Hit%  offset  cycles  records
-------  -------  ------  ------  ------  ------  ------------------------
 149       43       41    0xff14d42c400e3ec0
-------  -------  ------  ------  ------  ------  ------------------------
 8.72%   11.63%    0.00%  0x8     2001    165   cpupri_find_fitness
 1.34%    2.33%    0.00%  0x18    1456    151   cpupri_find_fitness
 8.72%    9.30%   58.54%  0x28    1744    263   cpupri_set
 2.01%    4.65%   41.46%  0x28    1958    301   cpupri_set
 1.34%    0.00%    0.00%  0x28    10580   6     cpupri_set
69.80%   67.44%    0.00%  0x30    1754    347   cpupri_set
 8.05%    4.65%    0.00%  0x30    2144    256   cpupri_set

Signed-off-by: Pan Deng <pan.deng@intel.com>
Signed-off-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/cpupri.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..245b0fa626be 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -9,7 +9,7 @@
 
 struct cpupri_vec {
 	atomic_t		count;
-	cpumask_var_t		mask;
+	cpumask_var_t		mask	____cacheline_aligned;
 };
 
 struct cpupri {
-- 
2.43.5

next prev parent reply	other threads:[~2025-07-07  2:31 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-07  2:35 [PATCH 0/4] sched/rt: mitigate root_domain cache line contention Pan Deng
2025-07-07  2:35 ` Pan Deng [this message]
2025-09-01  5:10   ` [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate " Chen, Yu C
2025-09-01 13:24     ` Deng, Pan
2025-07-07  2:35 ` [PATCH 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Pan Deng
2025-07-07  2:35 ` [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Pan Deng
2025-07-07  6:53   ` kernel test robot
2025-07-07 11:36     ` Deng, Pan
2025-07-07  6:53   ` kernel test robot
2025-07-08  5:33   ` kernel test robot
2025-07-08 14:02     ` Deng, Pan
2025-07-09  8:56       ` Li, Philip
2025-07-07  2:35 ` [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Pan Deng
2025-07-21 11:23   ` Chen, Yu C
2025-07-22 14:46     ` Deng, Pan
2025-08-06 14:00       ` Deng, Pan

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:d6cba002006 dfblob:245b0fa626b )
 OR (
bs:"[PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c3fa01bed2f875293ac65425c75a322e8e70e1d3.1751852370.git.pan.deng@intel.com \
    --to=pan.deng@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tianyou.li@intel.com \
    --cc=tim.c.chen@linux.intel.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).