From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>,
K Prateek Nayak <kprateek.nayak@amd.com>,
"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>, Juri Lelli <juri.lelli@redhat.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Valentin Schneider <vschneid@redhat.com>,
Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
Hillf Danton <hdanton@sina.com>,
Shrikanth Hegde <sshegde@linux.ibm.com>,
Jianyong Wu <jianyong.wu@outlook.com>,
Yangyu Chen <cyy@cyyself.name>,
Tingyin Duan <tingyin.duan@gmail.com>,
Vern Hao <vernhao@tencent.com>, Vern Hao <haoxing990@gmail.com>,
Len Brown <len.brown@intel.com>,
Tim Chen <tim.c.chen@linux.intel.com>,
Aubrey Li <aubrey.li@intel.com>, Zhao Liu <zhao1.liu@intel.com>,
Chen Yu <yu.chen.surf@gmail.com>,
Adam Li <adamli@os.amperecomputing.com>,
Aaron Lu <ziqianlu@bytedance.com>,
Tim Chen <tim.c.chen@intel.com>, Josh Don <joshdon@google.com>,
Gavin Guo <gavinguo@igalia.com>,
Qais Yousef <qyousef@layalina.io>,
Libo Chen <libchen@purestorage.com>,
linux-kernel@vger.kernel.org
Subject: [Patch v4 18/22] sched/cache: Enable cache aware scheduling for multi LLCs NUMA node
Date: Wed, 1 Apr 2026 14:52:30 -0700 [thread overview]
Message-ID: <71972e12ab4f08aff422b31e34df09bdbd94de84.1775065312.git.tim.c.chen@linux.intel.com> (raw)
In-Reply-To: <cover.1775065312.git.tim.c.chen@linux.intel.com>
From: Chen Yu <yu.c.chen@intel.com>
Introduce sched_cache_present to enable cache aware scheduling for
multi LLCs NUMA node Cache-aware load balancing should only be
enabled if there are more than 1 LLCs within 1 NUMA node.
sched_cache_present is introduced to indicate whether this
platform supports this topology.
Test results:
The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.
The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.
hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched
on these two platforms.
[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows limitted wakeup latency improvement.
ChaCha20-xiangshan(risc-v simulator) shows good throughput
improvement. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.
Genoa:
Significant improvement is observed in hackbench when
the active number of threads is lower than the number
of CPUs within 1 LLC. On v2, Aaron reported improvement
of hackbench/redis when system is underloaded.
ChaCha20-xiangshan shows huge throughput improvement.
Phoronix has tested v1 and shows good improvements in 30+
cases[3]. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.
Detail:
Due to length constraints, data without much difference with
baseline is not presented.
Sapphire Rapids:
[hackbench pipe]
================
case load baseline(std%) compare%( std%)
threads-pipe-10 1-groups 1.00 ( 1.22) +26.09 ( 1.10)
threads-pipe-10 2-groups 1.00 ( 4.90) +22.88 ( 0.18)
threads-pipe-10 4-groups 1.00 ( 2.07) +9.00 ( 3.49)
threads-pipe-10 8-groups 1.00 ( 8.13) +3.45 ( 3.62)
threads-pipe-16 1-groups 1.00 ( 2.11) +26.30 ( 0.08)
threads-pipe-16 2-groups 1.00 ( 15.13) -1.77 ( 11.89)
threads-pipe-16 4-groups 1.00 ( 4.37) +0.58 ( 7.99)
threads-pipe-16 8-groups 1.00 ( 2.88) +2.71 ( 3.50)
threads-pipe-2 1-groups 1.00 ( 9.40) +22.07 ( 0.71)
threads-pipe-2 2-groups 1.00 ( 9.99) +18.01 ( 0.95)
threads-pipe-2 4-groups 1.00 ( 3.98) +24.66 ( 0.96)
threads-pipe-2 8-groups 1.00 ( 7.00) +21.83 ( 0.23)
threads-pipe-20 1-groups 1.00 ( 1.03) +28.84 ( 0.21)
threads-pipe-20 2-groups 1.00 ( 4.42) +31.90 ( 3.15)
threads-pipe-20 4-groups 1.00 ( 9.97) +4.56 ( 1.69)
threads-pipe-20 8-groups 1.00 ( 1.87) +1.25 ( 0.74)
threads-pipe-4 1-groups 1.00 ( 4.48) +25.67 ( 0.78)
threads-pipe-4 2-groups 1.00 ( 9.14) +4.91 ( 2.08)
threads-pipe-4 4-groups 1.00 ( 7.68) +19.36 ( 1.53)
threads-pipe-4 8-groups 1.00 ( 10.79) +7.20 ( 12.20)
threads-pipe-8 1-groups 1.00 ( 4.69) +21.93 ( 0.03)
threads-pipe-8 2-groups 1.00 ( 1.16) +25.29 ( 0.65)
threads-pipe-8 4-groups 1.00 ( 2.23) -1.27 ( 3.62)
threads-pipe-8 8-groups 1.00 ( 4.65) -3.08 ( 2.75)
Note: The default number of fd in hackbench is changed from 20 to various
values to ensure that threads fit within a single LLC, especially on AMD
systems. Take "threads-pipe-8, 2-groups" for example, the number of fd
is 8, and 2 groups are created.
[schbench]
The 99th percentile wakeup latency shows some improvements when the
system is underload, while it does not bring much difference with
the increasing of system utilization.
99th Wakeup Latencies Base (mean std) Compare (mean std) Change
=========================================================================
thread=2 9.00(0.00) 9.00(1.73) 0.00%
thread=4 7.33(0.58) 6.33(0.58) +13.64%
thread=8 9.00(0.00) 7.67(1.15) +14.78%
thread=16 8.67(0.58) 8.67(1.53) 0.00%
thread=32 9.00(0.00) 7.00(0.00) +22.22%
thread=64 9.33(0.58) 9.67(0.58) -3.64%
thread=128 12.00(0.00) 12.00(0.00) 0.00%
[chacha20 on simulated risc-v]
baseline:
Host time spent: 67861ms
cache aware scheduling enabled:
Host time spent: 54441ms
Time reduced by 24%
Genoa:
[hackbench pipe]
The default number of fd is 20, which exceed the number of CPUs
in a LLC. So the fd is adjusted to 2, 4, 6, 8, 20 respectively.
Exclude the result with large run-to-run variance, 10% ~ 50%
improvement is observed when the system is underloaded:
[hackbench pipe]
================
case load baseline(std%) compare%( std%)
threads-pipe-2 1-groups 1.00 ( 2.89) +47.33 ( 1.20)
threads-pipe-2 2-groups 1.00 ( 3.88) +39.82 ( 0.61)
threads-pipe-2 4-groups 1.00 ( 8.76) +5.57 ( 13.10)
threads-pipe-20 1-groups 1.00 ( 4.61) +11.72 ( 1.06)
threads-pipe-20 2-groups 1.00 ( 6.18) +14.55 ( 1.47)
threads-pipe-20 4-groups 1.00 ( 2.99) +10.16 ( 4.49)
threads-pipe-4 1-groups 1.00 ( 4.23) +43.70 ( 2.14)
threads-pipe-4 2-groups 1.00 ( 3.68) +8.45 ( 4.04)
threads-pipe-4 4-groups 1.00 ( 17.72) +2.42 ( 1.14)
threads-pipe-6 1-groups 1.00 ( 3.10) +7.74 ( 3.83)
threads-pipe-6 2-groups 1.00 ( 3.42) +14.26 ( 4.53)
threads-pipe-6 4-groups 1.00 ( 10.34) +10.94 ( 7.12)
threads-pipe-8 1-groups 1.00 ( 4.21) +9.06 ( 4.43)
threads-pipe-8 2-groups 1.00 ( 1.88) +3.74 ( 0.58)
threads-pipe-8 4-groups 1.00 ( 2.78) +23.96 ( 1.18)
[chacha20 on simulated risc-v]
Host time spent: 54762ms
Host time spent: 28295ms
Time reduced by 48%
Suggested-by: Libo Chen <libchen@purestorage.com>
Suggested-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v3->v4:
Add test results into commit log.
kernel/sched/sched.h | 4 +++-
kernel/sched/topology.c | 18 ++++++++++++++++--
2 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 511c85572b96..518c798231ac 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4035,9 +4035,11 @@ static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct
#endif /* !CONFIG_SCHED_MM_CID */
#ifdef CONFIG_SCHED_CACHE
+DECLARE_STATIC_KEY_FALSE(sched_cache_present);
+
static inline bool sched_cache_enabled(void)
{
- return false;
+ return static_branch_unlikely(&sched_cache_present);
}
#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 995a42cb4697..0b1fc1b0709d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -809,6 +809,7 @@ enum s_alloc {
};
#ifdef CONFIG_SCHED_CACHE
+DEFINE_STATIC_KEY_FALSE(sched_cache_present);
static bool alloc_sd_llc(const struct cpumask *cpu_map,
struct s_data *d)
{
@@ -2674,6 +2675,7 @@ static int
build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr)
{
enum s_alloc alloc_state = sa_none;
+ bool has_multi_llcs = false;
struct sched_domain *sd;
struct s_data d;
struct rq *rq = NULL;
@@ -2784,10 +2786,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
* between LLCs and memory channels.
*/
nr_llcs = sd->span_weight / child->span_weight;
- if (nr_llcs == 1)
+ if (nr_llcs == 1) {
imb = sd->span_weight >> 3;
- else
+ } else {
imb = nr_llcs;
+ has_multi_llcs = true;
+ }
imb = max(1U, imb);
sd->imb_numa_nr = imb;
@@ -2842,6 +2846,16 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
ret = 0;
error:
+#ifdef CONFIG_SCHED_CACHE
+ /*
+ * TBD: check before writing to it. sched domain rebuild
+ * is not in the critical path, leave as-is for now.
+ */
+ if (!ret && has_multi_llcs)
+ static_branch_enable_cpuslocked(&sched_cache_present);
+ else
+ static_branch_disable_cpuslocked(&sched_cache_present);
+#endif
__free_domain_allocs(&d, alloc_state, cpu_map);
return ret;
--
2.32.0
next prev parent reply other threads:[~2026-04-01 21:47 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
2026-04-01 21:52 ` [Patch v4 01/22] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
2026-04-01 21:52 ` [Patch v4 02/22] sched/cache: Limit the scan number of CPUs when calculating task occupancy Tim Chen
2026-04-01 21:52 ` [Patch v4 03/22] sched/cache: Record per LLC utilization to guide cache aware scheduling decisions Tim Chen
2026-04-01 21:52 ` [Patch v4 04/22] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
2026-04-01 21:52 ` [Patch v4 05/22] sched/cache: Make LLC id continuous Tim Chen
2026-04-01 21:52 ` [Patch v4 06/22] sched/cache: Assign preferred LLC ID to processes Tim Chen
2026-04-01 21:52 ` [Patch v4 07/22] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
2026-04-01 21:52 ` [Patch v4 08/22] sched/cache: Introduce per CPU's tasks LLC preference counter Tim Chen
2026-04-01 21:52 ` [Patch v4 09/22] sched/cache: Calculate the percpu sd task LLC preference Tim Chen
2026-04-01 21:52 ` [Patch v4 10/22] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
2026-04-01 21:52 ` [Patch v4 11/22] sched/cache: Check local_group only once in update_sg_lb_stats() Tim Chen
2026-04-01 21:52 ` [Patch v4 12/22] sched/cache: Prioritize tasks preferring destination LLC during balancing Tim Chen
2026-04-01 21:52 ` [Patch v4 13/22] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
2026-04-01 21:52 ` [Patch v4 14/22] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
2026-04-01 21:52 ` [Patch v4 15/22] sched/cache: Respect LLC preference in task migration and detach Tim Chen
2026-04-01 21:52 ` [Patch v4 16/22] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
2026-04-01 21:52 ` [Patch v4 17/22] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
2026-04-01 21:52 ` Tim Chen [this message]
2026-04-01 21:52 ` [Patch v4 19/22] sched/cache: Allow the user space to turn on and off cache aware scheduling Tim Chen
2026-04-01 21:52 ` [Patch v4 20/22] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Tim Chen
2026-04-01 21:52 ` [Patch v4 21/22] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Tim Chen
2026-04-01 21:52 ` [Patch v4 22/22] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Tim Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=71972e12ab4f08aff422b31e34df09bdbd94de84.1775065312.git.tim.c.chen@linux.intel.com \
--to=tim.c.chen@linux.intel.com \
--cc=adamli@os.amperecomputing.com \
--cc=aubrey.li@intel.com \
--cc=bsegall@google.com \
--cc=cyy@cyyself.name \
--cc=dietmar.eggemann@arm.com \
--cc=gautham.shenoy@amd.com \
--cc=gavinguo@igalia.com \
--cc=haoxing990@gmail.com \
--cc=hdanton@sina.com \
--cc=jianyong.wu@outlook.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=len.brown@intel.com \
--cc=libchen@purestorage.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=qyousef@layalina.io \
--cc=rostedt@goodmis.org \
--cc=sshegde@linux.ibm.com \
--cc=tim.c.chen@intel.com \
--cc=tingyin.duan@gmail.com \
--cc=vernhao@tencent.com \
--cc=vincent.guittot@linaro.org \
--cc=vineethr@linux.ibm.com \
--cc=vschneid@redhat.com \
--cc=yu.c.chen@intel.com \
--cc=yu.chen.surf@gmail.com \
--cc=zhao1.liu@intel.com \
--cc=ziqianlu@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox