* [PATCH v3 00/21] Cache Aware Scheduling
@ 2026-02-10 22:18 Tim Chen
2026-02-10 22:18 ` [PATCH v3 01/21] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
` (21 more replies)
0 siblings, 22 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
This patch series introduces infrastructure for cache-aware load
balancing, with the goal of co-locating tasks that share data within
the same Last Level Cache (LLC) domain. By improving cache locality,
the scheduler can reduce cache bouncing and cache misses, ultimately
improving data access efficiency. The design builds on the initial
prototype from Peter [1].
This initial implementation treats threads within the same process as
entities that are likely to share data. During load balancing, the
scheduler attempts to aggregate such threads onto the same LLC domain
whenever possible.
Most of the feedback received on v2 has been addressed. There were
discussions around grouping tasks using mechanisms other than process
membership. While we agree that more flexible grouping is desirable, this
series intentionally focuses on establishing the basic process-based
grouping first, with alternative grouping mechanisms to be explored
in a follow-on series. As a step in that direction, cache aware
scheduling statistics have been separated from the mm structure into a
new sched_cache_stats structure. Thanks for the many useful feedbacks
at LPC 2025 and for v2, we'd like to create another separate thread to
discuss the possible user interfaces.
The load balancing algorithms remain largely unchanged. The main
changes in v3 are:
1. Cache-aware scheduling is skipped after repeated load balance
failures (up to cache_nice_tries). This avoids repeatedly attempting
cache-aware migrations when no movable tasks prefer the destination
LLC.
2. The busiest runqueue is no longer sorted to select tasks that prefer
the destination LLC. This sorting was costly, and equivalent
behavior can be achieved by skipping tasks that do not prefer the
destination LLC during cache-aware migrations.
3. The calculation of the LLC ID switches to using
sched_domain_topology_level data directly that simplifies
the ID derivation.
4. Accounting of the number of tasks preferring each LLC is now kept in
the lowest-level sched domain per CPU. This simplifies handling of
LLC resizing and changes in the number of LLC domains.
Test results:
The patch series was applied and tested on v6.19-rc3.
See: https://github.com/timcchen1298/linux/commits/cache_aware_v3
The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.
The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.
hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched
on these two platforms.
[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows overall wakeup latency improvement.
ChaCha20-xiangshan(risc-v simulator) shows good throughput
improvement. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.
Genoa:
Significant improvement is observed in hackbench when
the active number of threads is lower than the number
of CPUs within 1 LLC. On v2, Aaron reported improvement
of hackbench/redis when system is underloaded.
ChaCha20-xiangshan shows huge throughput improvement.
Phoronix has tested v1 and shows good improvements in 30+
cases[2]. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.
Detail:
Due to length constraints, data without much difference with baseline is not
presented.
Sapphire Rapids:
[hackbench pipe]
case load baseline(std%) compare%( std%)
threads-pipe-2 1-groups 1.00 ( 3.19) +29.06 ( 3.31)*
threads-pipe-2 2-groups 1.00 ( 9.61) +19.19 ( 0.55)*
threads-pipe-2 4-groups 1.00 ( 6.69) +15.02 ( 1.34)*
threads-pipe-2 8-groups 1.00 ( 1.83) +25.59 ( 1.46)*
threads-pipe-4 1-groups 1.00 ( 3.41) +28.63 ( 1.17)*
threads-pipe-4 2-groups 1.00 ( 15.62) +19.51 ( 0.82)
threads-pipe-4 4-groups 1.00 ( 0.19) +27.05 ( 0.74)*
threads-pipe-4 8-groups 1.00 ( 4.32) +5.64 ( 3.18)
threads-pipe-8 1-groups 1.00 ( 0.44) +24.68 ( 0.49)*
threads-pipe-8 2-groups 1.00 ( 2.03) +23.76 ( 0.52)*
threads-pipe-8 4-groups 1.00 ( 3.77) +7.16 ( 1.58)
threads-pipe-8 8-groups 1.00 ( 4.53) +6.88 ( 2.36)
threads-pipe-16 1-groups 1.00 ( 1.71) +28.46 ( 0.68)*
threads-pipe-16 2-groups 1.00 ( 4.25) -0.23 ( 0.97)
threads-pipe-16 4-groups 1.00 ( 0.64) -0.95 ( 3.74)
threads-pipe-16 8-groups 1.00 ( 1.23) +1.77 ( 0.31)
Note: The default number of fd in hackbench is changed from 20 to various
values to ensure that threads fit within a single LLC, especially on AMD
systems. Take "threads-pipe-8, 2-groups" for example, the number of fd
is 8, and 2 groups are created.
[schbench]
The 99th percentile wakeup latency shows overall improvements, while
the 99th percentile request latency exhibits increased some run-to-run
variance. The cache-aware scheduling logic, which scans all online CPUs
to identify the hottest LLC, may be the root cause of the elevated
request latency. It delays the task from returning to user space
due to the costly task_cache_work(). This issue should be mitigated by
restricting the scan to a limited set of NUMA nodes [3], and the fix is
planned to be integrated after the current version is in good shape.
99th Wakeup Latencies Base (mean±std) Compare (mean±std) Change
--------------------------------------------------------------------------------
thread = 2 13.33(1.15) 13.00(1.73) +2.48%
thread = 4 12.33(1.53) 9.67(1.53) +21.57%
thread = 8 10.00(0.00) 10.67(0.58) -6.70%
thread = 16 10.00(1.00) 9.33(0.58) +6.70%
thread = 32 10.33(0.58) 9.67(1.53) +6.39%
thread = 64 10.33(0.58) 9.33(1.53) +9.68%
thread = 128 12.67(0.58) 12.00(0.00) +5.29%
run-to-run variance regress at 1 messager + 8 worker:
Request Latencies 99.0th 3981.33(260.16) 4877.33(1880.57) -22.51%
[chacha200]
Time reduced by 20%
Genoa:
[hackbench pipe]
The default number of fd is 20, which exceed the number of CPUs
in a LLC. So the fd is adjusted to 2, 4, 8, 16 respectively.
Exclude the result with large run-to-run variance, 20% ~ 50%
improvement is observed when the system is underloaded:
case load baseline(std%) compare%( std%)
threads-pipe-2 1-groups 1.00 ( 4.04) +47.22 ( 4.77)*
threads-pipe-2 2-groups 1.00 ( 5.04) +33.79 ( 8.92)*
threads-pipe-2 4-groups 1.00 ( 5.82) +5.93 ( 7.97)
threads-pipe-2 8-groups 1.00 ( 16.15) -4.11 ( 6.85)
threads-pipe-4 1-groups 1.00 ( 7.28) +50.43 ( 2.39)*
threads-pipe-4 2-groups 1.00 ( 10.77) -4.31 ( 7.71)
threads-pipe-4 4-groups 1.00 ( 11.16) +8.12 ( 11.21)
threads-pipe-4 8-groups 1.00 ( 12.79) -10.10 ( 12.92)
threads-pipe-8 1-groups 1.00 ( 5.57) -1.50 ( 6.55)
threads-pipe-8 2-groups 1.00 ( 10.72) +0.69 ( 6.38)
threads-pipe-8 4-groups 1.00 ( 7.04) +19.70 ( 5.58)*
threads-pipe-8 8-groups 1.00 ( 7.11) +27.46 ( 2.34)*
threads-pipe-16 1-groups 1.00 ( 2.86) -12.82 ( 8.97)
threads-pipe-16 2-groups 1.00 ( 8.55) +2.96 ( 1.65)
threads-pipe-16 4-groups 1.00 ( 5.12) +20.49 ( 5.33)*
threads-pipe-16 8-groups 1.00 ( 3.23) +9.06 ( 2.87)
[chacha200]
baseline:
Host time spent: 51432ms
sched_cache:
Host time spent: 28664ms
Time reduced by 45%
[1] https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/
[2] https://www.phoronix.com/review/cache-aware-scheduling-amd-turin
[3] https://lore.kernel.org/all/865b852e3fdef6561c9e0a5be9a94aec8a68cdea.1760206683.git.tim.c.chen@linux.intel.com/
Change history:
**v3 Changes:**
1. Cache-aware scheduling is skipped after repeated load balance
failures (up to cache_nice_tries). This avoids repeatedly attempting
cache-aware migrations when no movable tasks prefer the destination
LLC.
2. The busiest runqueue is no longer sorted to select tasks that prefer
the destination LLC. This sorting was costly, and equivalent
behavior can be achieved by skipping tasks that do not prefer the
destination LLC during cache-aware migrations.
3. Accounting of the number of tasks preferring each LLC is now kept in
the lowest-level sched domain per CPU. This simplifies handling of
LLC resizing and changes in the number of LLC domains.
4. Other changes from v2 are detailed in each patch's change log.
**v2 Changes:**
v2 link: https://lore.kernel.org/all/cover.1764801860.git.tim.c.chen@linux.intel.com/
1. Align NUMA balancing and cache affinity by
prioritizing NUMA balancing when their decisions differ.
2. Dynamically resize per-LLC statistics structures based on the LLC
size.
3. Switch to a contiguous LLC-ID space so these IDs can be used
directly as array indices for LLC statistics.
4. Add clarification comments.
5. Add 3 debug patches (not meant for merging).
6. Other changes to address feedbacks from review of v1 patch set
(see individual patch change log).
**v1**
v1 link: https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/
Chen Yu (10):
sched/cache: Record per LLC utilization to guide cache aware
scheduling decisions
sched/cache: Introduce helper functions to enforce LLC migration
policy
sched/cache: Make LLC id continuous
sched/cache: Disable cache aware scheduling for processes with high
thread counts
sched/cache: Avoid cache-aware scheduling for memory-heavy processes
sched/cache: Enable cache aware scheduling for multi LLCs NUMA node
sched/cache: Allow the user space to turn on and off cache aware
scheduling
sched/cache: Add user control to adjust the aggressiveness of
cache-aware scheduling
-- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy
for each process via proc fs
-- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load
balance statistics
Peter Zijlstra (Intel) (1):
sched/cache: Introduce infrastructure for cache-aware load balancing
Tim Chen (10):
sched/cache: Assign preferred LLC ID to processes
sched/cache: Track LLC-preferred tasks per runqueue
sched/cache: Introduce per CPU's tasks LLC preference counter
sched/cache: Calculate the percpu sd task LLC preference
sched/cache: Count tasks prefering destination LLC in a sched group
sched/cache: Check local_group only once in update_sg_lb_stats()
sched/cache: Prioritize tasks preferring destination LLC during
balancing
sched/cache: Add migrate_llc_task migration type for cache-aware
balancing
sched/cache: Handle moving single tasks to/from their preferred LLC
sched/cache: Respect LLC preference in task migration and detach
fs/proc/base.c | 31 +
include/linux/cacheinfo.h | 21 +-
include/linux/mm_types.h | 43 ++
include/linux/sched.h | 32 +
include/linux/sched/topology.h | 8 +
include/trace/events/sched.h | 79 +++
init/Kconfig | 11 +
init/init_task.c | 3 +
kernel/fork.c | 6 +
kernel/sched/core.c | 11 +
kernel/sched/debug.c | 55 ++
kernel/sched/fair.c | 1088 +++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 44 ++
kernel/sched/topology.c | 194 +++++-
14 files changed, 1598 insertions(+), 28 deletions(-)
--
2.32.0
^ permalink raw reply [flat|nested] 117+ messages in thread
* [PATCH v3 01/21] sched/cache: Introduce infrastructure for cache-aware load balancing
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-14 12:26 ` Madadi Vineeth Reddy
2026-02-10 22:18 ` [PATCH v3 02/21] sched/cache: Record per LLC utilization to guide cache aware scheduling decisions Tim Chen
` (20 subsequent siblings)
21 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
From: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Adds infrastructure to enable cache-aware load balancing,
which improves cache locality by grouping tasks that share resources
within the same cache domain. This reduces cache misses and improves
overall data access efficiency.
In this initial implementation, threads belonging to the same process
are treated as entities that likely share working sets. The mechanism
tracks per-process CPU occupancy across cache domains and attempts to
migrate threads toward cache-hot domains where their process already
has active threads, thereby enhancing locality.
This provides a basic model for cache affinity. While the current code
targets the last-level cache (LLC), the approach could be extended to
other domain types such as clusters (L2) or node-internal groupings.
At present, the mechanism selects the CPU within an LLC that has the
highest recent runtime. Subsequent patches in this series will use this
information in the load-balancing path to guide task placement toward
preferred LLCs.
In the future, more advanced policies could be integrated through NUMA
balancing-for example, migrating a task to its preferred LLC when spare
capacity exists, or swapping tasks across LLCs to improve cache affinity.
Grouping of tasks could also be generalized from that of a process
to be that of a NUMA group, or be user configurable.
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
Notes:
v2->v3:
Fix the wrap in epoch for time comparison of mm->mm_sched_epoch.
(Peter Zijlstra)
Remove __no_profile tag. (Peter Zijlstra)
Introduce a new structure named sched_cache_stat
to save the statistics of cache aware scheduling, similar
to mm_mm_cid. (Peter Zijlstra)
include/linux/mm_types.h | 32 +++++
include/linux/sched.h | 24 ++++
init/Kconfig | 11 ++
kernel/fork.c | 6 +
kernel/sched/core.c | 6 +
kernel/sched/fair.c | 265 +++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 14 +++
7 files changed, 358 insertions(+)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 42af2292951d..777a48523aa6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1125,6 +1125,8 @@ struct mm_struct {
/* MM CID related storage */
struct mm_mm_cid mm_cid;
+ /* sched_cache related statistics */
+ struct sched_cache_stat sc_stat;
#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* size of all page tables */
#endif
@@ -1519,6 +1521,36 @@ static inline unsigned int mm_cid_size(void)
}
#endif /* CONFIG_SCHED_MM_CID */
+#ifdef CONFIG_SCHED_CACHE
+void mm_init_sched(struct mm_struct *mm,
+ struct sched_cache_time __percpu *pcpu_sched);
+
+static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
+{
+ struct sched_cache_time __percpu *pcpu_sched =
+ alloc_percpu_noprof(struct sched_cache_time);
+
+ if (!pcpu_sched)
+ return -ENOMEM;
+
+ mm_init_sched(mm, pcpu_sched);
+ return 0;
+}
+
+#define mm_alloc_sched(...) alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__))
+
+static inline void mm_destroy_sched(struct mm_struct *mm)
+{
+ free_percpu(mm->sc_stat.pcpu_sched);
+ mm->sc_stat.pcpu_sched = NULL;
+}
+#else /* !CONFIG_SCHED_CACHE */
+
+static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; }
+static inline void mm_destroy_sched(struct mm_struct *mm) { }
+
+#endif /* CONFIG_SCHED_CACHE */
+
struct mmu_gather;
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d395f2810fac..2817a21ee055 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1409,6 +1409,10 @@ struct task_struct {
unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_SCHED_CACHE
+ struct callback_head cache_work;
+#endif
+
struct rseq_data rseq;
struct sched_mm_cid mm_cid;
@@ -2330,6 +2334,26 @@ static __always_inline int task_mm_cid(struct task_struct *t)
}
#endif
+#ifdef CONFIG_SCHED_CACHE
+
+struct sched_cache_time {
+ u64 runtime;
+ unsigned long epoch;
+};
+
+struct sched_cache_stat {
+ struct sched_cache_time __percpu *pcpu_sched;
+ raw_spinlock_t lock;
+ unsigned long epoch;
+ int cpu;
+} ____cacheline_aligned_in_smp;
+
+#else
+
+struct sched_cache_stat { };
+
+#endif
+
#ifndef MODULE
#ifndef COMPILE_OFFSETS
diff --git a/init/Kconfig b/init/Kconfig
index fa79feb8fe57..f4b2649f8401 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -990,6 +990,17 @@ config NUMA_BALANCING
This system will be inactive on UMA systems.
+config SCHED_CACHE
+ bool "Cache aware load balance"
+ default y
+ depends on SMP
+ help
+ When enabled, the scheduler will attempt to aggregate tasks from
+ the same process onto a single Last Level Cache (LLC) domain when
+ possible. This improves cache locality by keeping tasks that share
+ resources within the same cache domain, reducing cache misses and
+ lowering data access latency.
+
config NUMA_BALANCING_DEFAULT_ENABLED
bool "Automatically enable NUMA aware memory/task placement"
default y
diff --git a/kernel/fork.c b/kernel/fork.c
index b1f3915d5f8e..2a49c49f29f9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -723,6 +723,7 @@ void __mmdrop(struct mm_struct *mm)
cleanup_lazy_tlbs(mm);
WARN_ON_ONCE(mm == current->active_mm);
+ mm_destroy_sched(mm);
mm_free_pgd(mm);
mm_free_id(mm);
destroy_context(mm);
@@ -1123,6 +1124,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
if (mm_alloc_cid(mm, p))
goto fail_cid;
+ if (mm_alloc_sched(mm))
+ goto fail_sched;
+
if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
NR_MM_COUNTERS))
goto fail_pcpu;
@@ -1132,6 +1136,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
return mm;
fail_pcpu:
+ mm_destroy_sched(mm);
+fail_sched:
mm_destroy_cid(mm);
fail_cid:
destroy_context(mm);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41ba0be16911..c6efa71cf500 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4412,6 +4412,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
init_numa_balancing(clone_flags, p);
p->wake_entry.u_flags = CSD_TYPE_TTWU;
p->migration_pending = NULL;
+ init_sched_mm(p);
}
DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -8691,6 +8692,11 @@ void __init sched_init(void)
rq->core_cookie = 0UL;
#endif
+#ifdef CONFIG_SCHED_CACHE
+ raw_spin_lock_init(&rq->cpu_epoch_lock);
+ rq->cpu_epoch_next = jiffies;
+#endif
+
zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..58286275e166 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1136,6 +1136,8 @@ void post_init_entity_util_avg(struct task_struct *p)
sa->runnable_avg = sa->util_avg;
}
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec);
+
static s64 update_se(struct rq *rq, struct sched_entity *se)
{
u64 now = rq_clock_task(rq);
@@ -1158,6 +1160,7 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
trace_sched_stat_runtime(running, delta_exec);
account_group_exec_runtime(running, delta_exec);
+ account_mm_sched(rq, running, delta_exec);
/* cgroup time is always accounted against the donor */
cgroup_account_cputime(donor, delta_exec);
@@ -1179,6 +1182,266 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
static void set_next_buddy(struct sched_entity *se);
+#ifdef CONFIG_SCHED_CACHE
+
+/*
+ * XXX numbers come from a place the sun don't shine -- probably wants to be SD
+ * tunable or so.
+ */
+#define EPOCH_PERIOD (HZ / 100) /* 10 ms */
+#define EPOCH_LLC_AFFINITY_TIMEOUT 5 /* 50 ms */
+
+static int llc_id(int cpu)
+{
+ if (cpu < 0)
+ return -1;
+
+ return per_cpu(sd_llc_id, cpu);
+}
+
+void mm_init_sched(struct mm_struct *mm,
+ struct sched_cache_time __percpu *_pcpu_sched)
+{
+ unsigned long epoch;
+ int i;
+
+ for_each_possible_cpu(i) {
+ struct sched_cache_time *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
+ struct rq *rq = cpu_rq(i);
+
+ pcpu_sched->runtime = 0;
+ pcpu_sched->epoch = rq->cpu_epoch;
+ epoch = rq->cpu_epoch;
+ }
+
+ raw_spin_lock_init(&mm->sc_stat.lock);
+ mm->sc_stat.epoch = epoch;
+ mm->sc_stat.cpu = -1;
+
+ /*
+ * The update to mm->sc_stat should not be reordered
+ * before initialization to mm's other fields, in case
+ * the readers may get invalid mm_sched_epoch, etc.
+ */
+ smp_store_release(&mm->sc_stat.pcpu_sched, _pcpu_sched);
+}
+
+/* because why would C be fully specified */
+static __always_inline void __shr_u64(u64 *val, unsigned int n)
+{
+ if (n >= 64) {
+ *val = 0;
+ return;
+ }
+ *val >>= n;
+}
+
+static inline void __update_mm_sched(struct rq *rq,
+ struct sched_cache_time *pcpu_sched)
+{
+ lockdep_assert_held(&rq->cpu_epoch_lock);
+
+ unsigned long n, now = jiffies;
+ long delta = now - rq->cpu_epoch_next;
+
+ if (delta > 0) {
+ n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+ rq->cpu_epoch += n;
+ rq->cpu_epoch_next += n * EPOCH_PERIOD;
+ __shr_u64(&rq->cpu_runtime, n);
+ }
+
+ n = rq->cpu_epoch - pcpu_sched->epoch;
+ if (n) {
+ pcpu_sched->epoch += n;
+ __shr_u64(&pcpu_sched->runtime, n);
+ }
+}
+
+static unsigned long fraction_mm_sched(struct rq *rq,
+ struct sched_cache_time *pcpu_sched)
+{
+ guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
+
+ __update_mm_sched(rq, pcpu_sched);
+
+ /*
+ * Runtime is a geometric series (r=0.5) and as such will sum to twice
+ * the accumulation period, this means the multiplcation here should
+ * not overflow.
+ */
+ return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
+}
+
+static inline
+void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
+{
+ struct sched_cache_time *pcpu_sched;
+ struct mm_struct *mm = p->mm;
+ unsigned long epoch;
+
+ if (!sched_cache_enabled())
+ return;
+
+ if (p->sched_class != &fair_sched_class)
+ return;
+ /*
+ * init_task, kthreads and user thread created
+ * by user_mode_thread() don't have mm.
+ */
+ if (!mm || !mm->sc_stat.pcpu_sched)
+ return;
+
+ pcpu_sched = per_cpu_ptr(p->mm->sc_stat.pcpu_sched, cpu_of(rq));
+
+ scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
+ __update_mm_sched(rq, pcpu_sched);
+ pcpu_sched->runtime += delta_exec;
+ rq->cpu_runtime += delta_exec;
+ epoch = rq->cpu_epoch;
+ }
+
+ /*
+ * If this process hasn't hit task_cache_work() for a while, or it
+ * has only 1 thread, invalidate its preferred state.
+ */
+ if (time_after(epoch,
+ READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) ||
+ get_nr_threads(p) <= 1) {
+ if (mm->sc_stat.cpu != -1)
+ mm->sc_stat.cpu = -1;
+ }
+}
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p)
+{
+ struct callback_head *work = &p->cache_work;
+ struct mm_struct *mm = p->mm;
+ unsigned long epoch;
+
+ if (!sched_cache_enabled())
+ return;
+
+ if (!mm || !mm->sc_stat.pcpu_sched)
+ return;
+
+ epoch = rq->cpu_epoch;
+ /* avoid moving backwards */
+ if (time_after_eq(mm->sc_stat.epoch, epoch))
+ return;
+
+ guard(raw_spinlock)(&mm->sc_stat.lock);
+
+ if (work->next == work) {
+ task_work_add(p, work, TWA_RESUME);
+ WRITE_ONCE(mm->sc_stat.epoch, epoch);
+ }
+}
+
+static void task_cache_work(struct callback_head *work)
+{
+ struct task_struct *p = current;
+ struct mm_struct *mm = p->mm;
+ unsigned long m_a_occ = 0;
+ unsigned long curr_m_a_occ = 0;
+ int cpu, m_a_cpu = -1;
+ cpumask_var_t cpus;
+
+ WARN_ON_ONCE(work != &p->cache_work);
+
+ work->next = work;
+
+ if (p->flags & PF_EXITING)
+ return;
+
+ if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
+ return;
+
+ scoped_guard (cpus_read_lock) {
+ cpumask_copy(cpus, cpu_online_mask);
+
+ for_each_cpu(cpu, cpus) {
+ /* XXX sched_cluster_active */
+ struct sched_domain *sd = per_cpu(sd_llc, cpu);
+ unsigned long occ, m_occ = 0, a_occ = 0;
+ int m_cpu = -1, i;
+
+ if (!sd)
+ continue;
+
+ for_each_cpu(i, sched_domain_span(sd)) {
+ occ = fraction_mm_sched(cpu_rq(i),
+ per_cpu_ptr(mm->sc_stat.pcpu_sched, i));
+ a_occ += occ;
+ if (occ > m_occ) {
+ m_occ = occ;
+ m_cpu = i;
+ }
+ }
+
+ /*
+ * Compare the accumulated occupancy of each LLC. The
+ * reason for using accumulated occupancy rather than average
+ * per CPU occupancy is that it works better in asymmetric LLC
+ * scenarios.
+ * For example, if there are 2 threads in a 4CPU LLC and 3
+ * threads in an 8CPU LLC, it might be better to choose the one
+ * with 3 threads. However, this would not be the case if the
+ * occupancy is divided by the number of CPUs in an LLC (i.e.,
+ * if average per CPU occupancy is used).
+ * Besides, NUMA balancing fault statistics behave similarly:
+ * the total number of faults per node is compared rather than
+ * the average number of faults per CPU. This strategy is also
+ * followed here.
+ */
+ if (a_occ > m_a_occ) {
+ m_a_occ = a_occ;
+ m_a_cpu = m_cpu;
+ }
+
+ if (llc_id(cpu) == llc_id(mm->sc_stat.cpu))
+ curr_m_a_occ = a_occ;
+
+ cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+ }
+ }
+
+ if (m_a_occ > (2 * curr_m_a_occ)) {
+ /*
+ * Avoid switching sc_stat.cpu too fast.
+ * The reason to choose 2X is because:
+ * 1. It is better to keep the preferred LLC stable,
+ * rather than changing it frequently and cause migrations
+ * 2. 2X means the new preferred LLC has at least 1 more
+ * busy CPU than the old one(200% vs 100%, eg)
+ * 3. 2X is chosen based on test results, as it delivers
+ * the optimal performance gain so far.
+ */
+ mm->sc_stat.cpu = m_a_cpu;
+ }
+
+ free_cpumask_var(cpus);
+}
+
+void init_sched_mm(struct task_struct *p)
+{
+ struct callback_head *work = &p->cache_work;
+
+ init_task_work(work, task_cache_work);
+ work->next = work;
+}
+
+#else
+
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
+ s64 delta_exec) { }
+
+void init_sched_mm(struct task_struct *p) { }
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
+
+#endif
+
/*
* Used by other classes to account runtime.
*/
@@ -13377,6 +13640,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+ task_tick_cache(rq, curr);
+
update_misfit_status(curr, rq);
check_update_overutilized_status(task_rq(curr));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d30cca6870f5..de5b701c3950 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1196,6 +1196,12 @@ struct rq {
u64 clock_pelt_idle_copy;
u64 clock_idle_copy;
#endif
+#ifdef CONFIG_SCHED_CACHE
+ raw_spinlock_t cpu_epoch_lock ____cacheline_aligned;
+ u64 cpu_runtime;
+ unsigned long cpu_epoch;
+ unsigned long cpu_epoch_next;
+#endif
atomic_t nr_iowait;
@@ -3890,6 +3896,14 @@ static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct
static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct *next) { }
#endif /* !CONFIG_SCHED_MM_CID */
+#ifdef CONFIG_SCHED_CACHE
+static inline bool sched_cache_enabled(void)
+{
+ return false;
+}
+#endif
+extern void init_sched_mm(struct task_struct *p);
+
extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
static inline
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 02/21] sched/cache: Record per LLC utilization to guide cache aware scheduling decisions
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
2026-02-10 22:18 ` [PATCH v3 01/21] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-10 22:18 ` [PATCH v3 03/21] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
` (19 subsequent siblings)
21 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
When a system becomes busy and a process's preferred LLC is
saturated with too many threads, tasks within that LLC migrate
frequently. These in LLC migrations introduce latency and degrade
performance. To avoid this, task aggregation should be suppressed
when the preferred LLC is overloaded, which requires a metric to
indicate LLC utilization.
Record per LLC utilization/cpu capacity during periodic load
balancing. These statistics will be used in later patches to decide
whether tasks should be aggregated into their preferred LLC.
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
Notes:
v2->v3:
Remove ____cacheline_aligned_in_smp attribute in
struct sched_domain_shared to avoid premature optimization.
(Peter Zijlstra)
include/linux/sched/topology.h | 4 ++
kernel/sched/fair.c | 70 ++++++++++++++++++++++++++++++++++
2 files changed, 74 insertions(+)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 45c0022b91ce..a4e2fb31f2fd 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -68,6 +68,10 @@ struct sched_domain_shared {
atomic_t nr_busy_cpus;
int has_idle_cores;
int nr_idle_scan;
+#ifdef CONFIG_SCHED_CACHE
+ unsigned long util_avg;
+ unsigned long capacity;
+#endif
};
struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 58286275e166..dfeb107f2cfd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9688,6 +9688,29 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
return 0;
}
+#ifdef CONFIG_SCHED_CACHE
+/* Called from load balancing paths with rcu_read_lock held */
+static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
+ unsigned long *cap)
+{
+ struct sched_domain_shared *sd_share;
+
+ sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+ if (!sd_share)
+ return false;
+
+ *util = READ_ONCE(sd_share->util_avg);
+ *cap = READ_ONCE(sd_share->capacity);
+
+ return true;
+}
+#else
+static inline bool get_llc_stats(int cpu, unsigned long *util,
+ unsigned long *cap)
+{
+ return false;
+}
+#endif
/*
* can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
*/
@@ -10658,6 +10681,52 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
return check_cpu_capacity(rq, sd);
}
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Record the statistics for this scheduler group for later
+ * use. These values guide load balancing on aggregating tasks
+ * to a LLC.
+ */
+static void record_sg_llc_stats(struct lb_env *env,
+ struct sg_lb_stats *sgs,
+ struct sched_group *group)
+{
+ struct sched_domain_shared *sd_share;
+
+ if (!sched_cache_enabled() || env->idle == CPU_NEWLY_IDLE)
+ return;
+
+ /* Only care about sched domain spanning multiple LLCs */
+ if (env->sd->child != rcu_dereference(per_cpu(sd_llc, env->dst_cpu)))
+ return;
+
+ /*
+ * At this point we know this group spans a LLC domain.
+ * Record the statistic of this group in its corresponding
+ * shared LLC domain.
+ * Note: sd_share cannot be obtained via sd->child->shared,
+ * because the latter refers to the domain that covers the
+ * local group. Instead, sd_share should be located using
+ * the first CPU of the LLC group.
+ */
+ sd_share = rcu_dereference(per_cpu(sd_llc_shared,
+ cpumask_first(sched_group_span(group))));
+ if (!sd_share)
+ return;
+
+ if (READ_ONCE(sd_share->util_avg) != sgs->group_util)
+ WRITE_ONCE(sd_share->util_avg, sgs->group_util);
+
+ if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
+ WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
+}
+#else
+static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
+ struct sched_group *group)
+{
+}
+#endif
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
@@ -10747,6 +10816,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
+ record_sg_llc_stats(env, sgs, group);
/* Computing avg_load makes sense only when group is overloaded */
if (sgs->group_type == group_overloaded)
sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 03/21] sched/cache: Introduce helper functions to enforce LLC migration policy
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
2026-02-10 22:18 ` [PATCH v3 01/21] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
2026-02-10 22:18 ` [PATCH v3 02/21] sched/cache: Record per LLC utilization to guide cache aware scheduling decisions Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-14 16:12 ` Madadi Vineeth Reddy
2026-02-19 11:29 ` Peter Zijlstra
2026-02-10 22:18 ` [PATCH v3 04/21] sched/cache: Make LLC id continuous Tim Chen
` (18 subsequent siblings)
21 siblings, 2 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
Cache-aware scheduling aggregates threads onto their preferred LLC,
mainly through load balancing. When the preferred LLC becomes
saturated, more threads are still placed there, increasing latency.
A mechanism is needed to limit aggregation so that the preferred LLC
does not become overloaded.
Introduce helper functions can_migrate_llc() and
can_migrate_llc_task() to enforce the LLC migration policy:
1. Aggregate a task to its preferred LLC if both source and
destination LLCs are not too busy, or if doing so will not
leave the preferred LLC much more imbalanced than the
non-preferred one (>20% utilization difference, a little
higher than imbalance_pct(17%) of the LLC domain as hysteresis).
2. Allow moving a task from overloaded preferred LLC to a non
preferred LLC if this will not cause the non preferred LLC
to become too imbalanced to cause a later migration back.
3. If both LLCs are too busy, let the generic load balance to
spread the tasks.
Further (hysteresis)action could be taken in the future to prevent tasks
from being migrated into and out of the preferred LLC frequently (back and
forth): the threshold for migrating a task out of its preferred LLC should
be higher than that for migrating it into the LLC.
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
Notes:
v2->v3:
No change.
kernel/sched/fair.c | 153 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 153 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dfeb107f2cfd..bf5f39a01017 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9689,6 +9689,27 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
}
#ifdef CONFIG_SCHED_CACHE
+/*
+ * The margin used when comparing LLC utilization with CPU capacity.
+ * It determines the LLC load level where active LLC aggregation is
+ * done.
+ * Derived from fits_capacity().
+ *
+ * (default: ~50%)
+ */
+#define fits_llc_capacity(util, max) \
+ ((util) * 2 < (max))
+
+/*
+ * The margin used when comparing utilization.
+ * is 'util1' noticeably greater than 'util2'
+ * Derived from capacity_greater().
+ * Bias is in perentage.
+ */
+/* Allows dst util to be bigger than src util by up to bias percent */
+#define util_greater(util1, util2) \
+ ((util1) * 100 > (util2) * 120)
+
/* Called from load balancing paths with rcu_read_lock held */
static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
unsigned long *cap)
@@ -9704,6 +9725,138 @@ static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
return true;
}
+
+/*
+ * Decision matrix according to the LLC utilization. To
+ * decide whether we can do task aggregation across LLC.
+ *
+ * By default, 50% is the threshold for treating the LLC
+ * as busy. The reason for choosing 50% is to avoid saturation
+ * of SMT-2, and it is also a safe cutoff for other SMT-n
+ * platforms.
+ *
+ * 20% is the utilization imbalance percentage to decide
+ * if the preferred LLC is busier than the non-preferred LLC.
+ * 20 is a little higher than the LLC domain's imbalance_pct
+ * 17. The hysteresis is used to avoid task bouncing between the
+ * preferred LLC and the non-preferred LLC.
+ *
+ * 1. moving towards the preferred LLC, dst is the preferred
+ * LLC, src is not.
+ *
+ * src \ dst 30% 40% 50% 60%
+ * 30% Y Y Y N
+ * 40% Y Y Y Y
+ * 50% Y Y G G
+ * 60% Y Y G G
+ *
+ * 2. moving out of the preferred LLC, src is the preferred
+ * LLC, dst is not:
+ *
+ * src \ dst 30% 40% 50% 60%
+ * 30% N N N N
+ * 40% N N N N
+ * 50% N N G G
+ * 60% Y N G G
+ *
+ * src : src_util
+ * dst : dst_util
+ * Y : Yes, migrate
+ * N : No, do not migrate
+ * G : let the Generic load balance to even the load.
+ *
+ * The intention is that if both LLCs are quite busy, cache aware
+ * load balance should not be performed, and generic load balance
+ * should take effect. However, if one is busy and the other is not,
+ * the preferred LLC capacity(50%) and imbalance criteria(20%) should
+ * be considered to determine whether LLC aggregation should be
+ * performed to bias the load towards the preferred LLC.
+ */
+
+/* migration decision, 3 states are orthogonal. */
+enum llc_mig {
+ mig_forbid = 0, /* N: Don't migrate task, respect LLC preference */
+ mig_llc, /* Y: Do LLC preference based migration */
+ mig_unrestricted /* G: Don't restrict generic load balance migration */
+};
+
+/*
+ * Check if task can be moved from the source LLC to the
+ * destination LLC without breaking cache aware preferrence.
+ * src_cpu and dst_cpu are arbitrary CPUs within the source
+ * and destination LLCs, respectively.
+ */
+static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
+ unsigned long tsk_util,
+ bool to_pref)
+{
+ unsigned long src_util, dst_util, src_cap, dst_cap;
+
+ if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
+ !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
+ return mig_unrestricted;
+
+ if (!fits_llc_capacity(dst_util, dst_cap) &&
+ !fits_llc_capacity(src_util, src_cap))
+ return mig_unrestricted;
+
+ src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
+ dst_util = dst_util + tsk_util;
+ if (to_pref) {
+ /*
+ * Don't migrate if we will get preferred LLC too
+ * heavily loaded and if the dest is much busier
+ * than the src, in which case migration will
+ * increase the imbalance too much.
+ */
+ if (!fits_llc_capacity(dst_util, dst_cap) &&
+ util_greater(dst_util, src_util))
+ return mig_forbid;
+ } else {
+ /*
+ * Don't migrate if we will leave preferred LLC
+ * too idle, or if this migration leads to the
+ * non-preferred LLC falls within sysctl_aggr_imb percent
+ * of preferred LLC, leading to migration again
+ * back to preferred LLC.
+ */
+ if (fits_llc_capacity(src_util, src_cap) ||
+ !util_greater(src_util, dst_util))
+ return mig_forbid;
+ }
+ return mig_llc;
+}
+
+/*
+ * Check if task p can migrate from source LLC to
+ * destination LLC in terms of cache aware load balance.
+ */
+static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
+ struct task_struct *p)
+{
+ struct mm_struct *mm;
+ bool to_pref;
+ int cpu;
+
+ mm = p->mm;
+ if (!mm)
+ return mig_unrestricted;
+
+ cpu = mm->sc_stat.cpu;
+ if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
+ return mig_unrestricted;
+
+ if (cpus_share_cache(dst_cpu, cpu))
+ to_pref = true;
+ else if (cpus_share_cache(src_cpu, cpu))
+ to_pref = false;
+ else
+ return mig_unrestricted;
+
+ return can_migrate_llc(src_cpu, dst_cpu,
+ task_util(p), to_pref);
+}
+
#else
static inline bool get_llc_stats(int cpu, unsigned long *util,
unsigned long *cap)
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (2 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 03/21] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-14 17:53 ` Madadi Vineeth Reddy
` (3 more replies)
2026-02-10 22:18 ` [PATCH v3 05/21] sched/cache: Assign preferred LLC ID to processes Tim Chen
` (17 subsequent siblings)
21 siblings, 4 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
Introduce an index mapping between CPUs and their LLCs. This provides
a continuous per LLC index needed for cache-aware load balancing in
later patches.
The existing per_cpu llc_id usually points to the first CPU of the
LLC domain, which is sparse and unsuitable as an array index. Using
llc_id directly would waste memory.
With the new mapping, CPUs in the same LLC share a continuous id:
per_cpu(llc_id, CPU=0...15) = 0
per_cpu(llc_id, CPU=16...31) = 1
per_cpu(llc_id, CPU=32...47) = 2
...
Once a CPU has been assigned an llc_id, this ID persists even when
the CPU is taken offline and brought back online, which can facilitate
the management of the ID.
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
Notes:
v2->v3:
Allocate the LLC id according to the topology level data directly, rather
than calculating from the sched domain. This simplifies the code.
(Peter Zijlstra, K Prateek Nayak)
kernel/sched/topology.c | 47 ++++++++++++++++++++++++++++++++++++++---
1 file changed, 44 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5ddedd..ca46b5cf7f78 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -20,6 +20,7 @@ void sched_domains_mutex_unlock(void)
/* Protected by sched_domains_mutex: */
static cpumask_var_t sched_domains_tmpmask;
static cpumask_var_t sched_domains_tmpmask2;
+static int tl_max_llcs;
static int __init sched_debug_setup(char *str)
{
@@ -658,7 +659,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
*/
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_size);
-DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_llc_id) = -1;
DEFINE_PER_CPU(int, sd_share_id);
DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
@@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
per_cpu(sd_llc_size, cpu) = size;
- per_cpu(sd_llc_id, cpu) = id;
rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
sd = lowest_flag_domain(cpu, SD_CLUSTER);
@@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* Set up domains for CPUs specified by the cpu_map: */
for_each_cpu(i, cpu_map) {
- struct sched_domain_topology_level *tl;
+ struct sched_domain_topology_level *tl, *tl_llc = NULL;
+ int lid;
sd = NULL;
for_each_sd_topology(tl) {
+ int flags = 0;
+
+ if (tl->sd_flags)
+ flags = (*tl->sd_flags)();
+
+ if (flags & SD_SHARE_LLC)
+ tl_llc = tl;
sd = build_sched_domain(tl, cpu_map, attr, sd, i);
@@ -2581,6 +2589,39 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
if (cpumask_equal(cpu_map, sched_domain_span(sd)))
break;
}
+
+ lid = per_cpu(sd_llc_id, i);
+ if (lid == -1) {
+ int j;
+
+ /*
+ * Assign the llc_id to the CPUs that do not
+ * have an LLC.
+ */
+ if (!tl_llc) {
+ per_cpu(sd_llc_id, i) = tl_max_llcs++;
+
+ continue;
+ }
+
+ /* try to reuse the llc_id of its siblings */
+ for_each_cpu(j, tl_llc->mask(tl_llc, i)) {
+ if (i == j)
+ continue;
+
+ lid = per_cpu(sd_llc_id, j);
+
+ if (lid != -1) {
+ per_cpu(sd_llc_id, i) = lid;
+
+ break;
+ }
+ }
+
+ /* a new LLC is detected */
+ if (lid == -1)
+ per_cpu(sd_llc_id, i) = tl_max_llcs++;
+ }
}
if (WARN_ON(!topology_span_sane(cpu_map)))
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 05/21] sched/cache: Assign preferred LLC ID to processes
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (3 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 04/21] sched/cache: Make LLC id continuous Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-14 18:36 ` Madadi Vineeth Reddy
2026-02-10 22:18 ` [PATCH v3 06/21] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
` (16 subsequent siblings)
21 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
With cache-aware scheduling enabled, each task is assigned a
preferred LLC ID. This allows quick identification of the LLC domain
where the task prefers to run, similar to numa_preferred_nid in
NUMA balancing.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v2->v3:
Add comments around code handling NUMA balance conflict with cache aware
scheduling. (Peter Zijlstra)
Check if NUMA balancing is disabled before checking numa_preferred_nid
(Jianyong Wu)
include/linux/sched.h | 1 +
init/init_task.c | 3 +++
kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 46 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2817a21ee055..c98bd1c46088 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1411,6 +1411,7 @@ struct task_struct {
#ifdef CONFIG_SCHED_CACHE
struct callback_head cache_work;
+ int preferred_llc;
#endif
struct rseq_data rseq;
diff --git a/init/init_task.c b/init/init_task.c
index 49b13d7c3985..baa420de2644 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -218,6 +218,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
.numa_group = NULL,
.numa_faults = NULL,
#endif
+#ifdef CONFIG_SCHED_CACHE
+ .preferred_llc = -1,
+#endif
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
.kasan_depth = 1,
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf5f39a01017..0b4ed0f2809d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1273,11 +1273,43 @@ static unsigned long fraction_mm_sched(struct rq *rq,
return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
}
+static int get_pref_llc(struct task_struct *p, struct mm_struct *mm)
+{
+ int mm_sched_llc = -1;
+
+ if (!mm)
+ return -1;
+
+ if (mm->sc_stat.cpu != -1) {
+ mm_sched_llc = llc_id(mm->sc_stat.cpu);
+
+#ifdef CONFIG_NUMA_BALANCING
+ /*
+ * Don't assign preferred LLC if it
+ * conflicts with NUMA balancing.
+ * This can happen when sched_setnuma() gets
+ * called, however it is not much of an issue
+ * because we expect account_mm_sched() to get
+ * called fairly regularly -- at a higher rate
+ * than sched_setnuma() at least -- and thus the
+ * conflict only exists for a short period of time.
+ */
+ if (static_branch_likely(&sched_numa_balancing) &&
+ p->numa_preferred_nid >= 0 &&
+ cpu_to_node(mm->sc_stat.cpu) != p->numa_preferred_nid)
+ mm_sched_llc = -1;
+#endif
+ }
+
+ return mm_sched_llc;
+}
+
static inline
void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
{
struct sched_cache_time *pcpu_sched;
struct mm_struct *mm = p->mm;
+ int mm_sched_llc = -1;
unsigned long epoch;
if (!sched_cache_enabled())
@@ -1311,6 +1343,11 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
}
+
+ mm_sched_llc = get_pref_llc(p, mm);
+
+ if (p->preferred_llc != mm_sched_llc)
+ p->preferred_llc = mm_sched_llc;
}
static void task_tick_cache(struct rq *rq, struct task_struct *p)
@@ -1440,6 +1477,11 @@ void init_sched_mm(struct task_struct *p) { }
static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
+static inline int get_pref_llc(struct task_struct *p,
+ struct mm_struct *mm)
+{
+ return -1;
+}
#endif
/*
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 06/21] sched/cache: Track LLC-preferred tasks per runqueue
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (4 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 05/21] sched/cache: Assign preferred LLC ID to processes Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-10 22:18 ` [PATCH v3 07/21] sched/cache: Introduce per CPU's tasks LLC preference counter Tim Chen
` (15 subsequent siblings)
21 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
For each runqueue, track the number of tasks with an LLC preference
and how many of them are running on their preferred LLC. This mirrors
nr_numa_running and nr_preferred_running for NUMA balancing, and will
be used by cache-aware load balancing in later patches.
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v2->v3:
Remove the sched_cache_enabled() check and make the
account_llc_{en,de}queue() depending on CONFIG_SCHED_CACHE,
so sched_llc_active in v2 can be removed.
(Peter Zijlstra)
kernel/sched/core.c | 5 +++++
kernel/sched/fair.c | 48 +++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 6 ++++++
3 files changed, 56 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c6efa71cf500..c464e370576f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -532,6 +532,11 @@ void __trace_set_current_state(int state_value)
}
EXPORT_SYMBOL(__trace_set_current_state);
+int task_llc(const struct task_struct *p)
+{
+ return per_cpu(sd_llc_id, task_cpu(p));
+}
+
/*
* Serialization rules:
*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b4ed0f2809d..6ad9ad2f918f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1199,6 +1199,30 @@ static int llc_id(int cpu)
return per_cpu(sd_llc_id, cpu);
}
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
+{
+ int pref_llc;
+
+ pref_llc = p->preferred_llc;
+ if (pref_llc < 0)
+ return;
+
+ rq->nr_llc_running++;
+ rq->nr_pref_llc_running += (pref_llc == task_llc(p));
+}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
+{
+ int pref_llc;
+
+ pref_llc = p->preferred_llc;
+ if (pref_llc < 0)
+ return;
+
+ rq->nr_llc_running--;
+ rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
+}
+
void mm_init_sched(struct mm_struct *mm,
struct sched_cache_time __percpu *_pcpu_sched)
{
@@ -1304,6 +1328,8 @@ static int get_pref_llc(struct task_struct *p, struct mm_struct *mm)
return mm_sched_llc;
}
+static unsigned int task_running_on_cpu(int cpu, struct task_struct *p);
+
static inline
void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
{
@@ -1346,8 +1372,13 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
mm_sched_llc = get_pref_llc(p, mm);
- if (p->preferred_llc != mm_sched_llc)
+ /* task not on rq accounted later in account_entity_enqueue() */
+ if (task_running_on_cpu(rq->cpu, p) &&
+ p->preferred_llc != mm_sched_llc) {
+ account_llc_dequeue(rq, p);
p->preferred_llc = mm_sched_llc;
+ account_llc_enqueue(rq, p);
+ }
}
static void task_tick_cache(struct rq *rq, struct task_struct *p)
@@ -1482,6 +1513,11 @@ static inline int get_pref_llc(struct task_struct *p,
{
return -1;
}
+
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p) {}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p) {}
+
#endif
/*
@@ -3970,9 +4006,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
update_load_add(&cfs_rq->load, se->load.weight);
if (entity_is_task(se)) {
+ struct task_struct *p = task_of(se);
struct rq *rq = rq_of(cfs_rq);
- account_numa_enqueue(rq, task_of(se));
+ account_numa_enqueue(rq, p);
+ account_llc_enqueue(rq, p);
list_add(&se->group_node, &rq->cfs_tasks);
}
cfs_rq->nr_queued++;
@@ -3983,7 +4021,11 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
update_load_sub(&cfs_rq->load, se->load.weight);
if (entity_is_task(se)) {
- account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+ struct task_struct *p = task_of(se);
+ struct rq *rq = rq_of(cfs_rq);
+
+ account_numa_dequeue(rq, p);
+ account_llc_dequeue(rq, p);
list_del_init(&se->group_node);
}
cfs_rq->nr_queued--;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de5b701c3950..35cea6aa32a4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1128,6 +1128,10 @@ struct rq {
unsigned int nr_preferred_running;
unsigned int numa_migrate_on;
#endif
+#ifdef CONFIG_SCHED_CACHE
+ unsigned int nr_pref_llc_running;
+ unsigned int nr_llc_running;
+#endif
#ifdef CONFIG_NO_HZ_COMMON
unsigned long last_blocked_load_update_tick;
unsigned int has_blocked_load;
@@ -1996,6 +2000,8 @@ init_numa_balancing(u64 clone_flags, struct task_struct *p)
#endif /* !CONFIG_NUMA_BALANCING */
+int task_llc(const struct task_struct *p);
+
static inline void
queue_balance_callback(struct rq *rq,
struct balance_callback *head,
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 07/21] sched/cache: Introduce per CPU's tasks LLC preference counter
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (5 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 06/21] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-20 10:45 ` Peter Zijlstra
2026-02-10 22:18 ` [PATCH v3 08/21] sched/cache: Calculate the percpu sd task LLC preference Tim Chen
` (14 subsequent siblings)
21 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
The lowest level of sched domain for each CPU is assigned an
array where each element tracks the number of tasks preferring
a given LLC, indexed from 0 to max_llcs - 1. Since each CPU
has its dedicated sd, this implies that each CPU will have
a dedicated task LLC preference counter.
For example, sd->pf[3] = 2 signifies that there
are 2 tasks on this runqueue which prefer to run within LLC3.
The load balancer can use this information to identify busy
runqueues and migrate tasks to their preferred LLC domains.
This array will be reallocated at runtime during sched domain
rebuild.
Introduce the buffer allocation mechanism, and the statistics
will be calculated in the subsequent patch.
Note: the LLC preference statistics of each CPU are reset on
sched domain rebuild and may under count temporarily, until the
CPU becomes idle and the count is cleared. This is a trade off
to avoid complex data synchronization across sched domain builds.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v2->v3:
Allocate preferred LLC buffer in rq->sd rather than
the rq. That way it automagically gets reallocated
and old buffer gets recycled during sched domain rebuild.
(Peter Zijlstra)
include/linux/sched/topology.h | 4 +++
kernel/sched/sched.h | 2 ++
kernel/sched/topology.c | 64 +++++++++++++++++++++++++++++++++-
3 files changed, 69 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index a4e2fb31f2fd..3aa6c101b2e4 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -102,6 +102,10 @@ struct sched_domain {
u64 max_newidle_lb_cost;
unsigned long last_decay_max_lb_cost;
+#ifdef CONFIG_SCHED_CACHE
+ unsigned int *pf;
+#endif
+
#ifdef CONFIG_SCHEDSTATS
/* sched_balance_rq() stats */
unsigned int lb_count[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 35cea6aa32a4..ac8c7ac1ac0d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3903,6 +3903,8 @@ static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct
#endif /* !CONFIG_SCHED_MM_CID */
#ifdef CONFIG_SCHED_CACHE
+extern int max_llcs;
+
static inline bool sched_cache_enabled(void)
{
return false;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ca46b5cf7f78..dae78b5915a7 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -21,6 +21,7 @@ void sched_domains_mutex_unlock(void)
static cpumask_var_t sched_domains_tmpmask;
static cpumask_var_t sched_domains_tmpmask2;
static int tl_max_llcs;
+int max_llcs;
static int __init sched_debug_setup(char *str)
{
@@ -628,6 +629,11 @@ static void destroy_sched_domain(struct sched_domain *sd)
if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
kfree(sd->shared);
+
+#ifdef CONFIG_SCHED_CACHE
+ /* only the bottom sd has pref_llc array */
+ kfree(sd->pf);
+#endif
kfree(sd);
}
@@ -747,10 +753,15 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
if (sd && sd_degenerate(sd)) {
tmp = sd;
sd = sd->parent;
- destroy_sched_domain(tmp);
+
if (sd) {
struct sched_group *sg = sd->groups;
+#ifdef CONFIG_SCHED_CACHE
+ /* move pf to parent as child is being destroyed */
+ sd->pf = tmp->pf;
+ tmp->pf = NULL;
+#endif
/*
* sched groups hold the flags of the child sched
* domain for convenience. Clear such flags since
@@ -762,6 +773,8 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
sd->child = NULL;
}
+
+ destroy_sched_domain(tmp);
}
sched_domain_debug(sd, cpu);
@@ -787,6 +800,46 @@ enum s_alloc {
sa_none,
};
+#ifdef CONFIG_SCHED_CACHE
+static bool alloc_sd_pref(const struct cpumask *cpu_map,
+ struct s_data *d)
+{
+ struct sched_domain *sd;
+ unsigned int *pf;
+ int i;
+
+ for_each_cpu(i, cpu_map) {
+ sd = *per_cpu_ptr(d->sd, i);
+ if (!sd)
+ goto err;
+
+ pf = kcalloc(tl_max_llcs, sizeof(unsigned int), GFP_KERNEL);
+ if (!pf)
+ goto err;
+
+ sd->pf = pf;
+ }
+
+ return true;
+err:
+ for_each_cpu(i, cpu_map) {
+ sd = *per_cpu_ptr(d->sd, i);
+ if (sd) {
+ kfree(sd->pf);
+ sd->pf = NULL;
+ }
+ }
+
+ return false;
+}
+#else
+static bool alloc_sd_pref(const struct cpumask *cpu_map,
+ struct s_data *d)
+{
+ return false;
+}
+#endif
+
/*
* Return the canonical balance CPU for this group, this is the first CPU
* of this group that's also in the balance mask.
@@ -2710,6 +2763,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
}
}
+ alloc_sd_pref(cpu_map, &d);
+
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
@@ -2723,6 +2778,13 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
}
rcu_read_unlock();
+ /*
+ * Ensure we see enlarged sd->pf when we use new llc_ids and
+ * bigger max_llcs.
+ */
+ smp_mb();
+ max_llcs = tl_max_llcs;
+
if (has_asym)
static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 08/21] sched/cache: Calculate the percpu sd task LLC preference
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (6 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 07/21] sched/cache: Introduce per CPU's tasks LLC preference counter Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-20 11:02 ` Peter Zijlstra
2026-02-10 22:18 ` [PATCH v3 09/21] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
` (13 subsequent siblings)
21 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
Calculate the number of tasks' LLC preferences for each runqueue.
This statistic is computed during task enqueue and dequeue
operations, and is used by the cache-aware load balancing.
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v2->v3: Move max_llcs check from patch4 to this patch.
This would clarify the rationale for the
max_llc check and makes review easier (Peter Zijlstra).
kernel/sched/fair.c | 56 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 54 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6ad9ad2f918f..4a98aa866d65 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1199,28 +1199,80 @@ static int llc_id(int cpu)
return per_cpu(sd_llc_id, cpu);
}
+static inline bool valid_llc_id(int id)
+{
+ if (unlikely(id < 0 || id >= max_llcs))
+ return false;
+
+ return true;
+}
+
+static inline bool valid_llc_buf(struct sched_domain *sd,
+ int id)
+{
+ /*
+ * The check for sd and its corresponding pf is to
+ * confirm that the sd->pf[] has been allocated in
+ * build_sched_domains() after the assignment of
+ * per_cpu(sd_llc_id, i). This is used to avoid
+ * the race condition.
+ */
+ if (unlikely(!sd || !sd->pf))
+ return false;
+
+ return valid_llc_id(id);
+}
+
static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
{
+ struct sched_domain *sd;
int pref_llc;
pref_llc = p->preferred_llc;
- if (pref_llc < 0)
+ if (!valid_llc_id(pref_llc))
return;
rq->nr_llc_running++;
rq->nr_pref_llc_running += (pref_llc == task_llc(p));
+
+ scoped_guard (rcu) {
+ sd = rcu_dereference(rq->sd);
+ if (valid_llc_buf(sd, pref_llc))
+ sd->pf[pref_llc]++;
+ }
}
static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
{
+ struct sched_domain *sd;
int pref_llc;
pref_llc = p->preferred_llc;
- if (pref_llc < 0)
+ if (!valid_llc_id(pref_llc))
return;
rq->nr_llc_running--;
rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
+
+ scoped_guard (rcu) {
+ sd = rcu_dereference(rq->sd);
+ if (valid_llc_buf(sd, pref_llc)) {
+ /*
+ * There is a race condition between dequeue
+ * and CPU hotplug. After a task has been enqueued
+ * on CPUx, a CPU hotplug event occurs, and all online
+ * CPUs (including CPUx) rebuild their sched_domains
+ * and reset statistics to zero (including sd->pf).
+ * This can cause temporary undercount and we have to
+ * check for such underflow in sd->pf.
+ *
+ * This undercount is temporary and accurate accounting
+ * will resume once the rq has a chance to be idle.
+ */
+ if (sd->pf[pref_llc])
+ sd->pf[pref_llc]--;
+ }
+ }
}
void mm_init_sched(struct mm_struct *mm,
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 09/21] sched/cache: Count tasks prefering destination LLC in a sched group
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (7 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 08/21] sched/cache: Calculate the percpu sd task LLC preference Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-20 12:52 ` Peter Zijlstra
2026-02-10 22:18 ` [PATCH v3 10/21] sched/cache: Check local_group only once in update_sg_lb_stats() Tim Chen
` (12 subsequent siblings)
21 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
During LLC load balancing, tabulate the number of tasks on each runqueue
that prefer the LLC contains the env->dst_cpu in a sched group.
For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
selected as the busiest source to pick tasks from.
Within a source LLC, the total number of tasks preferring a destination
LLC is computed by summing counts across all CPUs in that LLC. For
instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
LLC3, the total for LLC0 is 3.
These statistics allow the load balancer to choose tasks from source
sched groups that best match their preferred LLCs.
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v2->v3:
Rename nr_pref_llc to nr_pref_dst_llc for clarification.
kernel/sched/fair.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4a98aa866d65..bb93cc046d73 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10566,6 +10566,9 @@ struct sg_lb_stats {
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
#endif
+#ifdef CONFIG_SCHED_CACHE
+ unsigned int nr_pref_dst_llc;
+#endif
};
/*
@@ -11034,6 +11037,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
{
int i, nr_running, local_group, sd_flags = env->sd->flags;
bool balancing_at_rd = !env->sd->parent;
+#ifdef CONFIG_SCHED_CACHE
+ int dst_llc = llc_id(env->dst_cpu);
+#endif
memset(sgs, 0, sizeof(*sgs));
@@ -11054,6 +11060,15 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (cpu_overutilized(i))
*sg_overutilized = 1;
+#ifdef CONFIG_SCHED_CACHE
+ if (sched_cache_enabled() && llc_id(i) != dst_llc) {
+ struct sched_domain *sd_tmp = rcu_dereference(rq->sd);
+
+ if (valid_llc_buf(sd_tmp, dst_llc))
+ sgs->nr_pref_dst_llc += sd_tmp->pf[dst_llc];
+ }
+#endif
+
/*
* No need to call idle_cpu() if nr_running is not 0
*/
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 10/21] sched/cache: Check local_group only once in update_sg_lb_stats()
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (8 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 09/21] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-10 22:18 ` [PATCH v3 11/21] sched/cache: Prioritize tasks preferring destination LLC during balancing Tim Chen
` (11 subsequent siblings)
21 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
There is no need to check the local group twice for both
group_asym_packing and group_smt_balance. Adjust the code
to facilitate future checks for group types (cache-aware
load balancing) as well.
No functional changes are expected.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v2->v3:
No change.
kernel/sched/fair.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bb93cc046d73..b0cf4424d198 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11109,14 +11109,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_weight = group->group_weight;
- /* Check if dst CPU is idle and preferred to this group */
- if (!local_group && env->idle && sgs->sum_h_nr_running &&
- sched_group_asym(env, sgs, group))
- sgs->group_asym_packing = 1;
-
- /* Check for loaded SMT group to be balanced to dst CPU */
- if (!local_group && smt_balance(env, sgs, group))
- sgs->group_smt_balance = 1;
+ if (!local_group) {
+ /* Check if dst CPU is idle and preferred to this group */
+ if (env->idle && sgs->sum_h_nr_running &&
+ sched_group_asym(env, sgs, group))
+ sgs->group_asym_packing = 1;
+
+ /* Check for loaded SMT group to be balanced to dst CPU */
+ if (smt_balance(env, sgs, group))
+ sgs->group_smt_balance = 1;
+ }
sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 11/21] sched/cache: Prioritize tasks preferring destination LLC during balancing
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (9 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 10/21] sched/cache: Check local_group only once in update_sg_lb_stats() Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-17 18:33 ` Madadi Vineeth Reddy
2026-02-10 22:18 ` [PATCH v3 12/21] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
` (10 subsequent siblings)
21 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
During LLC load balancing, first check for tasks that prefer the
destination LLC and balance them to it before others.
Mark source sched groups containing tasks preferring non local LLCs
with the group_llc_balance flag. This ensures the load balancer later
pulls or pushes these tasks toward their preferred LLCs.
The load balancer selects the busiest sched_group and migrates tasks
to less busy groups to distribute load across CPUs.
With cache-aware scheduling enabled, the busiest sched_group is
the one with most tasks preferring the destination LLC. If
the group has the llc_balance flag set, cache aware load balancing is
triggered.
Introduce the helper function update_llc_busiest() to identify the
sched_group with the most tasks preferring the destination LLC.
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v2->v3:
Consider sd->nr_balance_failed when deciding whether
LLC load balance should be used.
(Peter Zijlstra)
kernel/sched/fair.c | 77 ++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 76 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b0cf4424d198..43dcf2827298 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9649,6 +9649,11 @@ enum group_type {
* from balancing the load across the system.
*/
group_imbalanced,
+ /*
+ * There are tasks running on non-preferred LLC, possible to move
+ * them to their preferred LLC without creating too much imbalance.
+ */
+ group_llc_balance,
/*
* The CPU is overloaded and can't provide expected CPU cycles to all
* tasks.
@@ -10561,6 +10566,7 @@ struct sg_lb_stats {
enum group_type group_type;
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
unsigned int group_smt_balance; /* Task on busy SMT be moved */
+ unsigned int group_llc_balance; /* Tasks should be moved to preferred LLC */
unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
@@ -10819,6 +10825,9 @@ group_type group_classify(unsigned int imbalance_pct,
if (group_is_overloaded(imbalance_pct, sgs))
return group_overloaded;
+ if (sgs->group_llc_balance)
+ return group_llc_balance;
+
if (sg_imbalanced(group))
return group_imbalanced;
@@ -11012,11 +11021,66 @@ static void record_sg_llc_stats(struct lb_env *env,
if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
}
+
+/*
+ * Do LLC balance on sched group that contains LLC, and have tasks preferring
+ * to run on LLC in idle dst_cpu.
+ */
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+ struct sched_group *group)
+{
+ if (!sched_cache_enabled())
+ return false;
+
+ if (env->sd->flags & SD_SHARE_LLC)
+ return false;
+
+ /*
+ * Don't do cache aware balancing if there
+ * are too many balance failures.
+ *
+ * Should fall back to regular load balancing
+ * after repeated cache aware balance failures.
+ */
+ if (env->sd->nr_balance_failed >=
+ env->sd->cache_nice_tries + 1)
+ return false;
+
+ if (sgs->nr_pref_dst_llc &&
+ can_migrate_llc(cpumask_first(sched_group_span(group)),
+ env->dst_cpu, 0, true) == mig_llc)
+ return true;
+
+ return false;
+}
+
+static bool update_llc_busiest(struct lb_env *env,
+ struct sg_lb_stats *busiest,
+ struct sg_lb_stats *sgs)
+{
+ /*
+ * There are more tasks that want to run on dst_cpu's LLC.
+ */
+ return sgs->nr_pref_dst_llc > busiest->nr_pref_dst_llc;
+}
#else
static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
struct sched_group *group)
{
}
+
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+ struct sched_group *group)
+{
+ return false;
+}
+
+static bool update_llc_busiest(struct lb_env *env,
+ struct sg_lb_stats *busiest,
+ struct sg_lb_stats *sgs)
+{
+ return false;
+}
#endif
/**
@@ -11118,6 +11182,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
/* Check for loaded SMT group to be balanced to dst CPU */
if (smt_balance(env, sgs, group))
sgs->group_smt_balance = 1;
+
+ /* Check for tasks in this group can be moved to their preferred LLC */
+ if (llc_balance(env, sgs, group))
+ sgs->group_llc_balance = 1;
}
sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
@@ -11181,6 +11249,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
/* Select the overloaded group with highest avg_load. */
return sgs->avg_load > busiest->avg_load;
+ case group_llc_balance:
+ /* Select the group with most tasks preferring dst LLC */
+ return update_llc_busiest(env, busiest, sgs);
+
case group_imbalanced:
/*
* Select the 1st imbalanced group as we don't have any way to
@@ -11443,6 +11515,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
return false;
break;
+ case group_llc_balance:
case group_imbalanced:
case group_asym_packing:
case group_smt_balance:
@@ -11575,6 +11648,7 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
return NULL;
break;
+ case group_llc_balance:
case group_imbalanced:
case group_asym_packing:
case group_smt_balance:
@@ -12074,7 +12148,8 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
* group's child domain.
*/
if (sds.prefer_sibling && local->group_type == group_has_spare &&
- sibling_imbalance(env, &sds, busiest, local) > 1)
+ (busiest->group_type == group_llc_balance ||
+ sibling_imbalance(env, &sds, busiest, local) > 1))
goto force_balance;
if (busiest->group_type != group_overloaded) {
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 12/21] sched/cache: Add migrate_llc_task migration type for cache-aware balancing
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (10 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 11/21] sched/cache: Prioritize tasks preferring destination LLC during balancing Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-10 22:18 ` [PATCH v3 13/21] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
` (9 subsequent siblings)
21 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
Introduce a new migration type, migrate_llc_task, to support
cache-aware load balancing.
After identifying the busiest sched_group (having the most tasks
preferring the destination LLC), mark migrations with this type.
During load balancing, each runqueue in the busiest sched_group is
examined, and the runqueue with the highest number of tasks preferring
the destination CPU is selected as the busiest runqueue.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v2->v3:
Let the enum and switch statements have the same order.
(Peter Zijlstra)
kernel/sched/fair.c | 38 +++++++++++++++++++++++++++++++++++++-
1 file changed, 37 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43dcf2827298..1697791ef11c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9665,7 +9665,8 @@ enum migration_type {
migrate_load = 0,
migrate_util,
migrate_task,
- migrate_misfit
+ migrate_misfit,
+ migrate_llc_task
};
#define LBF_ALL_PINNED 0x01
@@ -10266,6 +10267,10 @@ static int detach_tasks(struct lb_env *env)
env->imbalance = 0;
break;
+
+ case migrate_llc_task:
+ env->imbalance--;
+ break;
}
detach_task(p, env);
@@ -11902,6 +11907,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
return;
}
+#ifdef CONFIG_SCHED_CACHE
+ if (busiest->group_type == group_llc_balance) {
+ /* Move a task that prefer local LLC */
+ env->migration_type = migrate_llc_task;
+ env->imbalance = 1;
+ return;
+ }
+#endif
+
if (busiest->group_type == group_imbalanced) {
/*
* In the group_imb case we cannot rely on group-wide averages
@@ -12209,6 +12223,11 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
struct rq *busiest = NULL, *rq;
unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
unsigned int busiest_nr = 0;
+#ifdef CONFIG_SCHED_CACHE
+ unsigned int busiest_pref_llc = 0;
+ struct sched_domain *sd_tmp;
+ int dst_llc;
+#endif
int i;
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
@@ -12336,6 +12355,21 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
break;
+ case migrate_llc_task:
+#ifdef CONFIG_SCHED_CACHE
+ sd_tmp = rcu_dereference(rq->sd);
+ dst_llc = llc_id(env->dst_cpu);
+ if (valid_llc_buf(sd_tmp, dst_llc)) {
+ unsigned int this_pref_llc = sd_tmp->pf[dst_llc];
+
+ if (busiest_pref_llc < this_pref_llc) {
+ busiest_pref_llc = this_pref_llc;
+ busiest = rq;
+ }
+ }
+#endif
+ break;
+
}
}
@@ -12499,6 +12533,8 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
case migrate_misfit:
__schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
break;
+ case migrate_llc_task:
+ break;
}
}
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 13/21] sched/cache: Handle moving single tasks to/from their preferred LLC
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (11 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 12/21] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-17 19:00 ` Madadi Vineeth Reddy
2026-02-20 13:53 ` Peter Zijlstra
2026-02-10 22:18 ` [PATCH v3 14/21] sched/cache: Respect LLC preference in task migration and detach Tim Chen
` (8 subsequent siblings)
21 siblings, 2 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
In the generic load balance(non-cache-aware-load-balance),
if the busiest runqueue has only one task, active balancing may be
invoked to move it. However, this migration might break LLC locality.
Before migration, check whether the task is running on its preferred
LLC: Do not move a lone task to another LLC if it would move the task
away from its preferred LLC or cause excessive imbalance between LLCs.
On the other hand, if the migration type is migrate_llc_task, it means
that there are tasks on the env->src_cpu that want to be migrated to
their preferred LLC, launch the active load balance anyway.
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v2->v3:
Remove redundant rcu read lock in break_llc_locality().
kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 53 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1697791ef11c..03959a701514 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9999,12 +9999,60 @@ static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu
task_util(p), to_pref);
}
+/*
+ * Check if active load balance breaks LLC locality in
+ * terms of cache aware load balance.
+ */
+static inline bool
+alb_break_llc(struct lb_env *env)
+{
+ if (!sched_cache_enabled())
+ return false;
+
+ if (cpus_share_cache(env->src_cpu, env->dst_cpu))
+ return false;
+ /*
+ * All tasks prefer to stay on their current CPU.
+ * Do not pull a task from its preferred CPU if:
+ * 1. It is the only task running there; OR
+ * 2. Migrating it away from its preferred LLC would violate
+ * the cache-aware scheduling policy.
+ */
+ if (env->src_rq->nr_pref_llc_running &&
+ env->src_rq->nr_pref_llc_running == env->src_rq->cfs.h_nr_runnable) {
+ unsigned long util = 0;
+ struct task_struct *cur;
+
+ if (env->src_rq->nr_running <= 1)
+ return true;
+
+ /*
+ * Reach here in load balance with
+ * rcu_read_lock() protected.
+ */
+ cur = rcu_dereference(env->src_rq->curr);
+ if (cur)
+ util = task_util(cur);
+
+ if (can_migrate_llc(env->src_cpu, env->dst_cpu,
+ util, false) == mig_forbid)
+ return true;
+ }
+
+ return false;
+}
#else
static inline bool get_llc_stats(int cpu, unsigned long *util,
unsigned long *cap)
{
return false;
}
+
+static inline bool
+alb_break_llc(struct lb_env *env)
+{
+ return false;
+}
#endif
/*
* can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
@@ -12421,6 +12469,9 @@ static int need_active_balance(struct lb_env *env)
{
struct sched_domain *sd = env->sd;
+ if (alb_break_llc(env))
+ return 0;
+
if (asym_active_balance(env))
return 1;
@@ -12440,7 +12491,8 @@ static int need_active_balance(struct lb_env *env)
return 1;
}
- if (env->migration_type == migrate_misfit)
+ if (env->migration_type == migrate_misfit ||
+ env->migration_type == migrate_llc_task)
return 1;
return 0;
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 14/21] sched/cache: Respect LLC preference in task migration and detach
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (12 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 13/21] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-18 9:14 ` Madadi Vineeth Reddy
2026-02-10 22:18 ` [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
` (7 subsequent siblings)
21 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
During the final step of load balancing, can_migrate_task() now
considers a task's LLC preference before moving it out of its
preferred LLC.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v2->v3:
Use the similar mechanism as NUMA balancing, which skips over
the tasks that would degrade locality in can_migrate_task();
and only if nr_balanced_failed is high enough do we ignore that.
(Peter Zijlstra)
Let migrate_degrade_locality() take precedence over
migrate_degrades_llc(), which aims to migrate towards the preferred
NUMA node. (Peter Zijlstra)
kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 13 +++++++++
2 files changed, 73 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03959a701514..d1145997b88d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9973,8 +9973,8 @@ static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
* Check if task p can migrate from source LLC to
* destination LLC in terms of cache aware load balance.
*/
-static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
- struct task_struct *p)
+static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
+ struct task_struct *p)
{
struct mm_struct *mm;
bool to_pref;
@@ -10041,6 +10041,47 @@ alb_break_llc(struct lb_env *env)
return false;
}
+
+/*
+ * Check if migrating task p from env->src_cpu to
+ * env->dst_cpu breaks LLC localiy.
+ */
+static bool migrate_degrades_llc(struct task_struct *p, struct lb_env *env)
+{
+ if (!sched_cache_enabled())
+ return false;
+
+ if (task_has_sched_core(p))
+ return false;
+ /*
+ * Skip over tasks that would degrade LLC locality;
+ * only when nr_balanced_failed is sufficiently high do we
+ * ignore this constraint.
+ *
+ * Threshold of cache_nice_tries is set to 1 higher
+ * than nr_balance_failed to avoid excessive task
+ * migration at the same time. Refer to comments around
+ * llc_balance().
+ */
+ if (env->sd->nr_balance_failed >= env->sd->cache_nice_tries + 1)
+ return false;
+
+ /*
+ * We know the env->src_cpu has some tasks prefer to
+ * run on env->dst_cpu, skip the tasks do not prefer
+ * env->dst_cpu, and find the one that prefers.
+ */
+ if (env->migration_type == migrate_llc_task &&
+ task_llc(p) != llc_id(env->dst_cpu))
+ return true;
+
+ if (can_migrate_llc_task(env->src_cpu,
+ env->dst_cpu, p) != mig_forbid)
+ return false;
+
+ return true;
+}
+
#else
static inline bool get_llc_stats(int cpu, unsigned long *util,
unsigned long *cap)
@@ -10053,6 +10094,12 @@ alb_break_llc(struct lb_env *env)
{
return false;
}
+
+static inline bool
+migrate_degrades_llc(struct task_struct *p, struct lb_env *env)
+{
+ return false;
+}
#endif
/*
* can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
@@ -10150,10 +10197,19 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
return 1;
degrades = migrate_degrades_locality(p, env);
- if (!degrades)
+ if (!degrades) {
+ /*
+ * If the NUMA locality is not broken,
+ * further check if migration would hurt
+ * LLC locality.
+ */
+ if (migrate_degrades_llc(p, env))
+ return 0;
+
hot = task_hot(p, env);
- else
+ } else {
hot = degrades > 0;
+ }
if (!hot || env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
if (hot)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ac8c7ac1ac0d..c18e59f320a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1495,6 +1495,14 @@ extern void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags);
extern void sched_core_get(void);
extern void sched_core_put(void);
+static inline bool task_has_sched_core(struct task_struct *p)
+{
+ if (sched_core_disabled())
+ return false;
+
+ return !!p->core_cookie;
+}
+
#else /* !CONFIG_SCHED_CORE: */
static inline bool sched_core_enabled(struct rq *rq)
@@ -1534,6 +1542,11 @@ static inline bool sched_group_cookie_match(struct rq *rq,
return true;
}
+static inline bool task_has_sched_core(struct task_struct *p)
+{
+ return false;
+}
+
#endif /* !CONFIG_SCHED_CORE */
#ifdef CONFIG_RT_GROUP_SCHED
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (13 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 14/21] sched/cache: Respect LLC preference in task migration and detach Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-18 17:54 ` Madadi Vineeth Reddy
2026-02-19 16:50 ` Peter Zijlstra
2026-02-10 22:18 ` [PATCH v3 16/21] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
` (6 subsequent siblings)
21 siblings, 2 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
A performance regression was observed by Prateek when running hackbench
with many threads per process (high fd count). To avoid this, processes
with a large number of active threads are excluded from cache-aware
scheduling.
With sched_cache enabled, record the number of active threads in each
process during the periodic task_cache_work(). While iterating over
CPUs, if the currently running task belongs to the same process as the
task that launched task_cache_work(), increment the active thread count.
If the number of active threads within the process exceeds the number
of Cores(divided by SMTs number) in the LLC, do not enable cache-aware
scheduling. For users who wish to perform task aggregation regardless,
a debugfs knob is provided for tuning in a subsequent patch.
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
Notes:
v2->v3:
Put the calculating of nr_running_avg and the use of it into 1 patch.
(Peter Zijlstra)
Use guard(rcu)() when calculating the number of active threads of the
process.
(Peter Zijlstra)
Introduce update_avg_scale() rather than using update_avg() to fit
system with small LLC.
(Aaron Lu)
include/linux/sched.h | 1 +
kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++++++++++---
2 files changed, 57 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c98bd1c46088..511c9b263386 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2346,6 +2346,7 @@ struct sched_cache_stat {
struct sched_cache_time __percpu *pcpu_sched;
raw_spinlock_t lock;
unsigned long epoch;
+ u64 nr_running_avg;
int cpu;
} ____cacheline_aligned_in_smp;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d1145997b88d..86b6b08e7e1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1223,6 +1223,19 @@ static inline bool valid_llc_buf(struct sched_domain *sd,
return valid_llc_id(id);
}
+static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
+{
+ int smt_nr = 1;
+
+#ifdef CONFIG_SCHED_SMT
+ if (sched_smt_active())
+ smt_nr = cpumask_weight(cpu_smt_mask(cpu));
+#endif
+
+ return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
+ per_cpu(sd_llc_size, cpu));
+}
+
static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
{
struct sched_domain *sd;
@@ -1417,7 +1430,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
*/
if (time_after(epoch,
READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) ||
- get_nr_threads(p) <= 1) {
+ get_nr_threads(p) <= 1 ||
+ exceed_llc_nr(mm, cpu_of(rq))) {
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
}
@@ -1458,13 +1472,31 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
}
}
+static inline void update_avg_scale(u64 *avg, u64 sample)
+{
+ int factor = per_cpu(sd_llc_size, raw_smp_processor_id());
+ s64 diff = sample - *avg;
+ u32 divisor;
+
+ /*
+ * Scale the divisor based on the number of CPUs contained
+ * in the LLC. This scaling ensures smaller LLC domains use
+ * a smaller divisor to achieve more precise sensitivity to
+ * changes in nr_running, while larger LLC domains are capped
+ * at a maximum divisor of 8 which is the default smoothing
+ * factor of EWMA in update_avg().
+ */
+ divisor = clamp_t(u32, (factor >> 2), 2, 8);
+ *avg += div64_s64(diff, divisor);
+}
+
static void task_cache_work(struct callback_head *work)
{
- struct task_struct *p = current;
+ struct task_struct *p = current, *cur;
struct mm_struct *mm = p->mm;
unsigned long m_a_occ = 0;
unsigned long curr_m_a_occ = 0;
- int cpu, m_a_cpu = -1;
+ int cpu, m_a_cpu = -1, nr_running = 0;
cpumask_var_t cpus;
WARN_ON_ONCE(work != &p->cache_work);
@@ -1474,6 +1506,13 @@ static void task_cache_work(struct callback_head *work)
if (p->flags & PF_EXITING)
return;
+ if (get_nr_threads(p) <= 1) {
+ if (mm->sc_stat.cpu != -1)
+ mm->sc_stat.cpu = -1;
+
+ return;
+ }
+
if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
return;
@@ -1497,6 +1536,12 @@ static void task_cache_work(struct callback_head *work)
m_occ = occ;
m_cpu = i;
}
+ scoped_guard (rcu) {
+ cur = rcu_dereference(cpu_rq(i)->curr);
+ if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
+ cur->mm == mm)
+ nr_running++;
+ }
}
/*
@@ -1540,6 +1585,7 @@ static void task_cache_work(struct callback_head *work)
mm->sc_stat.cpu = m_a_cpu;
}
+ update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running);
free_cpumask_var(cpus);
}
@@ -9988,6 +10034,13 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
return mig_unrestricted;
+ /* skip cache aware load balance for single/too many threads */
+ if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu)) {
+ if (mm->sc_stat.cpu != -1)
+ mm->sc_stat.cpu = -1;
+ return mig_unrestricted;
+ }
+
if (cpus_share_cache(dst_cpu, cpu))
to_pref = true;
else if (cpus_share_cache(src_cpu, cpu))
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 16/21] sched/cache: Avoid cache-aware scheduling for memory-heavy processes
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (14 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-10 22:18 ` [PATCH v3 17/21] sched/cache: Enable cache aware scheduling for multi LLCs NUMA node Tim Chen
` (5 subsequent siblings)
21 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.
To mitigate this, estimate a process's memory footprint by comparing
its RSS (anonymous and shared pages) to the size of the LLC. If RSS
exceeds the LLC size, skip cache-aware scheduling.
Note that RSS is only an approximation of the memory footprint.
By default, the comparison is strict, but a later patch will allow
users to provide a hint to adjust this threshold.
According to the test from Adam, some systems do not have shared L3
but with shared L2 as clusters. In this case, the L2 becomes the LLC[1].
Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@os.amperecomputing.com/
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
Notes:
v2->v3:
Fix overflow issue in exceed_llc_capacity() by changing
the type of llc from int to u64.
(Jianyong Wu, Yangyu Chen)
include/linux/cacheinfo.h | 21 ++++++++++-------
kernel/sched/fair.c | 48 +++++++++++++++++++++++++++++++++++----
2 files changed, 56 insertions(+), 13 deletions(-)
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index c8f4f0a0b874..82d0d59ca0e1 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf);
-/*
- * Get the cacheinfo structure for the cache associated with @cpu at
- * level @level.
- * cpuhp lock must be held.
- */
-static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int level)
{
struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
int i;
- lockdep_assert_cpus_held();
-
for (i = 0; i < ci->num_leaves; i++) {
if (ci->info_list[i].level == level) {
if (ci->info_list[i].attributes & CACHE_ID)
@@ -136,6 +129,18 @@ static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
return NULL;
}
+/*
+ * Get the cacheinfo structure for the cache associated with @cpu at
+ * level @level.
+ * cpuhp lock must be held.
+ */
+static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+{
+ lockdep_assert_cpus_held();
+
+ return _get_cpu_cacheinfo_level(cpu, level);
+}
+
/*
* Get the id of the cache associated with @cpu at level @level.
* cpuhp lock must be held.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 86b6b08e7e1e..ee4982af2bdd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1223,6 +1223,37 @@ static inline bool valid_llc_buf(struct sched_domain *sd,
return valid_llc_id(id);
}
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+{
+ struct cacheinfo *ci;
+ u64 rss, llc;
+
+ /*
+ * get_cpu_cacheinfo_level() can not be used
+ * because it requires the cpu_hotplug_lock
+ * to be held. Use _get_cpu_cacheinfo_level()
+ * directly because the 'cpu' can not be
+ * offlined at the moment.
+ */
+ ci = _get_cpu_cacheinfo_level(cpu, 3);
+ if (!ci) {
+ /*
+ * On system without L3 but with shared L2,
+ * L2 becomes the LLC.
+ */
+ ci = _get_cpu_cacheinfo_level(cpu, 2);
+ if (!ci)
+ return true;
+ }
+
+ llc = ci->size;
+
+ rss = get_mm_counter(mm, MM_ANONPAGES) +
+ get_mm_counter(mm, MM_SHMEMPAGES);
+
+ return (llc <= (rss * PAGE_SIZE));
+}
+
static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
{
int smt_nr = 1;
@@ -1431,7 +1462,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
if (time_after(epoch,
READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) ||
get_nr_threads(p) <= 1 ||
- exceed_llc_nr(mm, cpu_of(rq))) {
+ exceed_llc_nr(mm, cpu_of(rq)) ||
+ exceed_llc_capacity(mm, cpu_of(rq))) {
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
}
@@ -1496,7 +1528,7 @@ static void task_cache_work(struct callback_head *work)
struct mm_struct *mm = p->mm;
unsigned long m_a_occ = 0;
unsigned long curr_m_a_occ = 0;
- int cpu, m_a_cpu = -1, nr_running = 0;
+ int cpu, m_a_cpu = -1, nr_running = 0, curr_cpu;
cpumask_var_t cpus;
WARN_ON_ONCE(work != &p->cache_work);
@@ -1506,7 +1538,9 @@ static void task_cache_work(struct callback_head *work)
if (p->flags & PF_EXITING)
return;
- if (get_nr_threads(p) <= 1) {
+ curr_cpu = task_cpu(p);
+ if (get_nr_threads(p) <= 1 ||
+ exceed_llc_capacity(mm, curr_cpu)) {
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
@@ -10034,8 +10068,12 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
return mig_unrestricted;
- /* skip cache aware load balance for single/too many threads */
- if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu)) {
+ /*
+ * Skip cache aware load balance for single/too many threads
+ * or large memory RSS.
+ */
+ if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
+ exceed_llc_capacity(mm, dst_cpu)) {
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
return mig_unrestricted;
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 17/21] sched/cache: Enable cache aware scheduling for multi LLCs NUMA node
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (15 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 16/21] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-10 22:18 ` [PATCH v3 18/21] sched/cache: Allow the user space to turn on and off cache aware scheduling Tim Chen
` (4 subsequent siblings)
21 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
Introduce sched_cache_present to enable cache aware scheduling for
multi LLCs NUMA node Cache-aware load balancing should only be
enabled if there are more than 1 LLCs within 1 NUMA node.
sched_cache_present is introduced to indicate whether this
platform supports this topology.
Suggested-by: Libo Chen <libchen@purestorage.com>
Suggested-by: Adam Li <adamli@os.amperecomputing.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
Notes:
v2->v3:
No change.
kernel/sched/sched.h | 3 ++-
kernel/sched/topology.c | 18 ++++++++++++++++--
2 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c18e59f320a6..59ac04625842 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3916,11 +3916,12 @@ static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct
#endif /* !CONFIG_SCHED_MM_CID */
#ifdef CONFIG_SCHED_CACHE
+DECLARE_STATIC_KEY_FALSE(sched_cache_present);
extern int max_llcs;
static inline bool sched_cache_enabled(void)
{
- return false;
+ return static_branch_unlikely(&sched_cache_present);
}
#endif
extern void init_sched_mm(struct task_struct *p);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index dae78b5915a7..9104fed25351 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -801,6 +801,7 @@ enum s_alloc {
};
#ifdef CONFIG_SCHED_CACHE
+DEFINE_STATIC_KEY_FALSE(sched_cache_present);
static bool alloc_sd_pref(const struct cpumask *cpu_map,
struct s_data *d)
{
@@ -2604,6 +2605,7 @@ static int
build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr)
{
enum s_alloc alloc_state = sa_none;
+ bool has_multi_llcs = false;
struct sched_domain *sd;
struct s_data d;
struct rq *rq = NULL;
@@ -2731,10 +2733,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
* between LLCs and memory channels.
*/
nr_llcs = sd->span_weight / child->span_weight;
- if (nr_llcs == 1)
+ if (nr_llcs == 1) {
imb = sd->span_weight >> 3;
- else
+ } else {
imb = nr_llcs;
+ has_multi_llcs = true;
+ }
imb = max(1U, imb);
sd->imb_numa_nr = imb;
@@ -2796,6 +2800,16 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
ret = 0;
error:
+#ifdef CONFIG_SCHED_CACHE
+ /*
+ * TBD: check before writing to it. sched domain rebuild
+ * is not in the critical path, leave as-is for now.
+ */
+ if (!ret && has_multi_llcs)
+ static_branch_enable_cpuslocked(&sched_cache_present);
+ else
+ static_branch_disable_cpuslocked(&sched_cache_present);
+#endif
__free_domain_allocs(&d, alloc_state, cpu_map);
return ret;
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 18/21] sched/cache: Allow the user space to turn on and off cache aware scheduling
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (16 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 17/21] sched/cache: Enable cache aware scheduling for multi LLCs NUMA node Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-10 22:18 ` [PATCH v3 19/21] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Tim Chen
` (3 subsequent siblings)
21 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
Provide a debugfs knob to allow the user to turn off and on the
cache aware scheduling at runtime.
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v2->v3:
Split into a new patch for better review, use kstrtobool_from_user()
to get the user input. (Peter Zijlstra)
kernel/sched/debug.c | 45 ++++++++++++++++++++++++++++
kernel/sched/sched.h | 7 +++--
kernel/sched/topology.c | 65 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 115 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 41caa22e0680..bae747eddc59 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -215,6 +215,46 @@ static const struct file_operations sched_scaling_fops = {
.release = single_release,
};
+#ifdef CONFIG_SCHED_CACHE
+static ssize_t
+sched_cache_enable_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ bool val;
+ int ret;
+
+ ret = kstrtobool_from_user(ubuf, cnt, &val);
+ if (ret)
+ return ret;
+
+ sysctl_sched_cache_user = val;
+
+ sched_cache_active_set_unlocked();
+
+ return cnt;
+}
+
+static int sched_cache_enable_show(struct seq_file *m, void *v)
+{
+ seq_printf(m, "%d\n", sysctl_sched_cache_user);
+ return 0;
+}
+
+static int sched_cache_enable_open(struct inode *inode,
+ struct file *filp)
+{
+ return single_open(filp, sched_cache_enable_show, NULL);
+}
+
+static const struct file_operations sched_cache_enable_fops = {
+ .open = sched_cache_enable_open,
+ .write = sched_cache_enable_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+#endif
+
#ifdef CONFIG_PREEMPT_DYNAMIC
static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
@@ -523,6 +563,11 @@ static __init int sched_init_debug(void)
debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
#endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_SCHED_CACHE
+ debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
+ &sched_cache_enable_fops);
+#endif
+
debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
debugfs_fair_server_init();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 59ac04625842..adf3428745dd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3917,12 +3917,15 @@ static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct
#ifdef CONFIG_SCHED_CACHE
DECLARE_STATIC_KEY_FALSE(sched_cache_present);
-extern int max_llcs;
+DECLARE_STATIC_KEY_FALSE(sched_cache_active);
+extern int max_llcs, sysctl_sched_cache_user;
static inline bool sched_cache_enabled(void)
{
- return static_branch_unlikely(&sched_cache_present);
+ return static_branch_unlikely(&sched_cache_active);
}
+
+extern void sched_cache_active_set_unlocked(void);
#endif
extern void init_sched_mm(struct task_struct *p);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9104fed25351..e86dea1b9e86 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -801,7 +801,16 @@ enum s_alloc {
};
#ifdef CONFIG_SCHED_CACHE
+/* hardware support for cache aware scheduling */
DEFINE_STATIC_KEY_FALSE(sched_cache_present);
+/*
+ * Indicator of whether cache aware scheduling
+ * is active, used by the scheduler.
+ */
+DEFINE_STATIC_KEY_FALSE(sched_cache_active);
+/* user wants cache aware scheduling [0 or 1] */
+int sysctl_sched_cache_user = 1;
+
static bool alloc_sd_pref(const struct cpumask *cpu_map,
struct s_data *d)
{
@@ -833,6 +842,60 @@ static bool alloc_sd_pref(const struct cpumask *cpu_map,
return false;
}
+
+static void _sched_cache_active_set(bool enable, bool locked)
+{
+ if (enable) {
+ if (locked)
+ static_branch_enable_cpuslocked(&sched_cache_active);
+ else
+ static_branch_enable(&sched_cache_active);
+ } else {
+ if (locked)
+ static_branch_disable_cpuslocked(&sched_cache_active);
+ else
+ static_branch_disable(&sched_cache_active);
+ }
+}
+
+/*
+ * Enable/disable cache aware scheduling according to
+ * user input and the presence of hardware support.
+ */
+static void sched_cache_active_set(bool locked)
+{
+ /* hardware does not support */
+ if (!static_branch_likely(&sched_cache_present)) {
+ _sched_cache_active_set(false, locked);
+ return;
+ }
+
+ /*
+ * user wants it or not ?
+ * TBD: read before writing the static key.
+ * It is not in the critical path, leave as-is
+ * for now.
+ */
+ if (sysctl_sched_cache_user) {
+ _sched_cache_active_set(true, locked);
+ if (sched_debug())
+ pr_info("%s: enabling cache aware scheduling\n", __func__);
+ } else {
+ _sched_cache_active_set(false, locked);
+ if (sched_debug())
+ pr_info("%s: disabling cache aware scheduling\n", __func__);
+ }
+}
+
+static void sched_cache_active_set_locked(void)
+{
+ return sched_cache_active_set(true);
+}
+
+void sched_cache_active_set_unlocked(void)
+{
+ return sched_cache_active_set(false);
+}
#else
static bool alloc_sd_pref(const struct cpumask *cpu_map,
struct s_data *d)
@@ -2809,6 +2872,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
static_branch_enable_cpuslocked(&sched_cache_present);
else
static_branch_disable_cpuslocked(&sched_cache_present);
+
+ sched_cache_active_set_locked();
#endif
__free_domain_allocs(&d, alloc_state, cpu_map);
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 19/21] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (17 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 18/21] sched/cache: Allow the user space to turn on and off cache aware scheduling Tim Chen
@ 2026-02-10 22:18 ` Tim Chen
2026-02-20 14:29 ` Peter Zijlstra
2026-02-10 22:19 ` [PATCH v3 20/21] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Tim Chen
` (2 subsequent siblings)
21 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:18 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
Introduce a set of debugfs knobs to control how aggressive the
cache aware scheduling do the task aggregation.
(1) llc_aggr_tolerance
With sched_cache enabled, the scheduler uses a process's RSS as a
proxy for its LLC footprint to determine if aggregating tasks on the
preferred LLC could cause cache contention. If RSS exceeds the LLC
size, aggregation is skipped. Some workloads with large RSS but small
actual memory footprints may still benefit from aggregation. Since
the kernel cannot efficiently track per-task cache usage (resctrl is
user-space only), userspace can provide a more accurate hint.
Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
users control how strictly RSS limits aggregation. Values range from
0 to 100:
- 0: Cache-aware scheduling is disabled.
- 1: Strict; tasks with RSS larger than LLC size are skipped.
- >=100: Aggressive; tasks are aggregated regardless of RSS.
For example, with a 32MB L3 cache:
- llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
- llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
(784GB = (1 + (99 - 1) * 256) * 32MB).
Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
how strictly the number of active threads is considered when doing
cache aware load balance. The number of SMTs is also considered.
High SMT counts reduce the aggregation capacity, preventing excessive
task aggregation on SMT-heavy systems like Power10/Power11.
Yangyu suggested introducing separate aggregation controls for the
number of active threads and memory RSS checks. Since there are plans
to add per-process/task group controls, fine-grained tunables are
deferred to that implementation.
(2) llc_epoch_period, llc_epoch_affinity_timeout,
llc_imb_pct, llc_overaggr_pct are also turned into tunable.
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: Jianyong Wu <jianyong.wu@outlook.com>
Suggested-by: Yangyu Chen <cyy@cyyself.name>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
Notes:
v2->v3:
Simplify the implementation by using debugfs_create_u32() for all
tunable parameters.
kernel/sched/debug.c | 10 ++++++++
kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++++++++------
kernel/sched/sched.h | 5 ++++
3 files changed, 67 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index bae747eddc59..dc4b7de6569f 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -566,6 +566,16 @@ static __init int sched_init_debug(void)
#ifdef CONFIG_SCHED_CACHE
debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
&sched_cache_enable_fops);
+ debugfs_create_u32("llc_aggr_tolerance", 0644, debugfs_sched,
+ &llc_aggr_tolerance);
+ debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
+ &llc_epoch_period);
+ debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
+ &llc_epoch_affinity_timeout);
+ debugfs_create_u32("llc_overaggr_pct", 0644, debugfs_sched,
+ &llc_overaggr_pct);
+ debugfs_create_u32("llc_imb_pct", 0644, debugfs_sched,
+ &llc_imb_pct);
#endif
debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee4982af2bdd..da4291ace24c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1191,6 +1191,12 @@ static void set_next_buddy(struct sched_entity *se);
#define EPOCH_PERIOD (HZ / 100) /* 10 ms */
#define EPOCH_LLC_AFFINITY_TIMEOUT 5 /* 50 ms */
+__read_mostly unsigned int llc_aggr_tolerance = 1;
+__read_mostly unsigned int llc_epoch_period = EPOCH_PERIOD;
+__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
+__read_mostly unsigned int llc_imb_pct = 20;
+__read_mostly unsigned int llc_overaggr_pct = 50;
+
static int llc_id(int cpu)
{
if (cpu < 0)
@@ -1223,10 +1229,22 @@ static inline bool valid_llc_buf(struct sched_domain *sd,
return valid_llc_id(id);
}
+static inline int get_sched_cache_scale(int mul)
+{
+ if (!llc_aggr_tolerance)
+ return 0;
+
+ if (llc_aggr_tolerance >= 100)
+ return INT_MAX;
+
+ return (1 + (llc_aggr_tolerance - 1) * mul);
+}
+
static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
{
struct cacheinfo *ci;
u64 rss, llc;
+ int scale;
/*
* get_cpu_cacheinfo_level() can not be used
@@ -1251,20 +1269,47 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
rss = get_mm_counter(mm, MM_ANONPAGES) +
get_mm_counter(mm, MM_SHMEMPAGES);
- return (llc <= (rss * PAGE_SIZE));
+ /*
+ * Scale the LLC size by 256*llc_aggr_tolerance
+ * and compare it to the task's RSS size.
+ *
+ * Suppose the L3 size is 32MB. If the
+ * llc_aggr_tolerance is 1:
+ * When the RSS is larger than 32MB, the process
+ * is regarded as exceeding the LLC capacity. If
+ * the llc_aggr_tolerance is 99:
+ * When the RSS is larger than 784GB, the process
+ * is regarded as exceeding the LLC capacity:
+ * 784GB = (1 + (99 - 1) * 256) * 32MB
+ * If the llc_aggr_tolerance is 100:
+ * ignore the RSS.
+ */
+ scale = get_sched_cache_scale(256);
+ if (scale == INT_MAX)
+ return false;
+
+ return ((llc * scale) <= (rss * PAGE_SIZE));
}
static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
{
- int smt_nr = 1;
+ int smt_nr = 1, scale;
#ifdef CONFIG_SCHED_SMT
if (sched_smt_active())
smt_nr = cpumask_weight(cpu_smt_mask(cpu));
#endif
+ /*
+ * Scale the number of 'cores' in a LLC by llc_aggr_tolerance
+ * and compare it to the task's active threads.
+ */
+ scale = get_sched_cache_scale(1);
+ if (scale == INT_MAX)
+ return false;
+
return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
- per_cpu(sd_llc_size, cpu));
+ (scale * per_cpu(sd_llc_size, cpu)));
}
static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
@@ -1365,7 +1410,7 @@ static inline void __update_mm_sched(struct rq *rq,
long delta = now - rq->cpu_epoch_next;
if (delta > 0) {
- n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+ n = (delta + llc_epoch_period - 1) / llc_epoch_period;
rq->cpu_epoch += n;
rq->cpu_epoch_next += n * EPOCH_PERIOD;
__shr_u64(&rq->cpu_runtime, n);
@@ -1460,7 +1505,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
* has only 1 thread, invalidate its preferred state.
*/
if (time_after(epoch,
- READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) ||
+ READ_ONCE(mm->sc_stat.epoch) + llc_epoch_affinity_timeout) ||
get_nr_threads(p) <= 1 ||
exceed_llc_nr(mm, cpu_of(rq)) ||
exceed_llc_capacity(mm, cpu_of(rq))) {
@@ -9920,7 +9965,7 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
* (default: ~50%)
*/
#define fits_llc_capacity(util, max) \
- ((util) * 2 < (max))
+ ((util) * 100 < (max) * llc_overaggr_pct)
/*
* The margin used when comparing utilization.
@@ -9930,7 +9975,7 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
*/
/* Allows dst util to be bigger than src util by up to bias percent */
#define util_greater(util1, util2) \
- ((util1) * 100 > (util2) * 120)
+ ((util1) * 100 > (util2) * (100 + llc_imb_pct))
/* Called from load balancing paths with rcu_read_lock held */
static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index adf3428745dd..f4785f84b1f1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3919,6 +3919,11 @@ static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct
DECLARE_STATIC_KEY_FALSE(sched_cache_present);
DECLARE_STATIC_KEY_FALSE(sched_cache_active);
extern int max_llcs, sysctl_sched_cache_user;
+extern unsigned int llc_aggr_tolerance;
+extern unsigned int llc_epoch_period;
+extern unsigned int llc_epoch_affinity_timeout;
+extern unsigned int llc_imb_pct;
+extern unsigned int llc_overaggr_pct;
static inline bool sched_cache_enabled(void)
{
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 20/21] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (18 preceding siblings ...)
2026-02-10 22:18 ` [PATCH v3 19/21] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Tim Chen
@ 2026-02-10 22:19 ` Tim Chen
2026-02-10 22:19 ` [PATCH v3 21/21] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Tim Chen
2026-02-19 14:08 ` [PATCH v3 00/21] Cache Aware Scheduling Qais Yousef
21 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:19 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
Debug patch only.
Show the per-LLC occupancy in /proc/{PID}/schedstat, with each column
corresponding to one LLC. This can be used to verify if the cache-aware
load balancer works as expected by aggregating threads onto dedicated LLCs.
Suppose there are 2 LLCs and the sampling duration is 10 seconds:
Enable the cache aware load balance:
0 12281 <--- LLC0 residency delta is 0, LLC1 is 12 seconds
0 18881
0 16217
disable the cache aware load balance:
6497 15802
9299 5435
17811 8278
Co-developed-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v2->v3:
Enhance the informational output by printing the task's
preferred LLC. (Aaron Lu)
fs/proc/base.c | 31 +++++++++++++++++++++++++
include/linux/mm_types.h | 17 +++++++++++---
include/linux/sched.h | 6 +++++
kernel/sched/fair.c | 50 ++++++++++++++++++++++++++++++++++++----
4 files changed, 97 insertions(+), 7 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4eec684baca9..76b49e80af1a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -518,6 +518,37 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
(unsigned long long)task->se.sum_exec_runtime,
(unsigned long long)task->sched_info.run_delay,
task->sched_info.pcount);
+#ifdef CONFIG_SCHED_CACHE
+ if (sched_cache_inuse()) {
+ struct mm_struct *mm = task->mm;
+ u64 *llc_runtime;
+ int mm_sched_llc;
+
+ if (!mm)
+ return 0;
+
+ llc_runtime = kcalloc(max_llcs, sizeof(u64), GFP_KERNEL);
+ if (!llc_runtime)
+ return 0;
+
+ if (get_mm_per_llc_runtime(task, llc_runtime))
+ goto out;
+
+ if (mm->sc_stat.cpu == -1)
+ mm_sched_llc = -1;
+ else
+ mm_sched_llc = llc_id(mm->sc_stat.cpu);
+
+ for (int i = 0; i < max_llcs; i++)
+ seq_printf(m, "%s%s%llu ",
+ i == task->preferred_llc ? "*" : "",
+ i == mm_sched_llc ? "?" : "",
+ llc_runtime[i]);
+ seq_puts(m, "\n");
+out:
+ kfree(llc_runtime);
+ }
+#endif
return 0;
}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 777a48523aa6..2b8d0ec032e8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1523,17 +1523,26 @@ static inline unsigned int mm_cid_size(void)
#ifdef CONFIG_SCHED_CACHE
void mm_init_sched(struct mm_struct *mm,
- struct sched_cache_time __percpu *pcpu_sched);
+ struct sched_cache_time __percpu *pcpu_sched,
+ struct sched_cache_time __percpu *pcpu_time);
static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
{
struct sched_cache_time __percpu *pcpu_sched =
- alloc_percpu_noprof(struct sched_cache_time);
+ alloc_percpu_noprof(struct sched_cache_time),
+ *pcpu_time;
if (!pcpu_sched)
return -ENOMEM;
- mm_init_sched(mm, pcpu_sched);
+ pcpu_time = alloc_percpu_noprof(struct sched_cache_time);
+ if (!pcpu_time) {
+ free_percpu(pcpu_sched);
+ return -ENOMEM;
+ }
+
+ mm_init_sched(mm, pcpu_sched, pcpu_time);
+
return 0;
}
@@ -1542,7 +1551,9 @@ static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
static inline void mm_destroy_sched(struct mm_struct *mm)
{
free_percpu(mm->sc_stat.pcpu_sched);
+ free_percpu(mm->sc_stat.pcpu_time);
mm->sc_stat.pcpu_sched = NULL;
+ mm->sc_stat.pcpu_time = NULL;
}
#else /* !CONFIG_SCHED_CACHE */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 511c9b263386..4236cacbb409 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2344,12 +2344,18 @@ struct sched_cache_time {
struct sched_cache_stat {
struct sched_cache_time __percpu *pcpu_sched;
+ struct sched_cache_time __percpu *pcpu_time;
raw_spinlock_t lock;
unsigned long epoch;
u64 nr_running_avg;
int cpu;
} ____cacheline_aligned_in_smp;
+int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf);
+bool sched_cache_inuse(void);
+extern int max_llcs;
+int llc_id(int cpu);
+
#else
struct sched_cache_stat { };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da4291ace24c..25cee3dd767c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1197,7 +1197,12 @@ __read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEO
__read_mostly unsigned int llc_imb_pct = 20;
__read_mostly unsigned int llc_overaggr_pct = 50;
-static int llc_id(int cpu)
+bool sched_cache_inuse(void)
+{
+ return sched_cache_enabled();
+}
+
+int llc_id(int cpu)
{
if (cpu < 0)
return -1;
@@ -1365,17 +1370,20 @@ static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
}
void mm_init_sched(struct mm_struct *mm,
- struct sched_cache_time __percpu *_pcpu_sched)
+ struct sched_cache_time __percpu *_pcpu_sched,
+ struct sched_cache_time __percpu *_pcpu_time)
{
unsigned long epoch;
int i;
for_each_possible_cpu(i) {
struct sched_cache_time *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
+ struct sched_cache_time *pcpu_time = per_cpu_ptr(_pcpu_time, i);
struct rq *rq = cpu_rq(i);
pcpu_sched->runtime = 0;
pcpu_sched->epoch = rq->cpu_epoch;
+ pcpu_time->runtime = 0;
epoch = rq->cpu_epoch;
}
@@ -1389,6 +1397,8 @@ void mm_init_sched(struct mm_struct *mm,
* the readers may get invalid mm_sched_epoch, etc.
*/
smp_store_release(&mm->sc_stat.pcpu_sched, _pcpu_sched);
+ /* barrier */
+ smp_store_release(&mm->sc_stat.pcpu_time, _pcpu_time);
}
/* because why would C be fully specified */
@@ -1474,7 +1484,8 @@ static unsigned int task_running_on_cpu(int cpu, struct task_struct *p);
static inline
void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
{
- struct sched_cache_time *pcpu_sched;
+ struct sched_cache_time *pcpu_sched,
+ *pcpu_time;
struct mm_struct *mm = p->mm;
int mm_sched_llc = -1;
unsigned long epoch;
@@ -1488,14 +1499,18 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
* init_task, kthreads and user thread created
* by user_mode_thread() don't have mm.
*/
- if (!mm || !mm->sc_stat.pcpu_sched)
+ if (!mm || !mm->sc_stat.pcpu_sched ||
+ !mm->sc_stat.pcpu_time)
return;
pcpu_sched = per_cpu_ptr(p->mm->sc_stat.pcpu_sched, cpu_of(rq));
+ pcpu_time = per_cpu_ptr(p->mm->sc_stat.pcpu_time, cpu_of(rq));
scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
__update_mm_sched(rq, pcpu_sched);
pcpu_sched->runtime += delta_exec;
+ /* pure runtime without decay */
+ pcpu_time->runtime += delta_exec;
rq->cpu_runtime += delta_exec;
epoch = rq->cpu_epoch;
}
@@ -1676,6 +1691,33 @@ void init_sched_mm(struct task_struct *p)
work->next = work;
}
+/* p->pi_lock is hold */
+int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf)
+{
+ struct sched_cache_time *pcpu_time;
+ struct mm_struct *mm = p->mm;
+ int cpu;
+
+ if (!mm)
+ return -EINVAL;
+
+ rcu_read_lock();
+ for_each_online_cpu(cpu) {
+ int llc = llc_id(cpu);
+ u64 runtime_ms;
+
+ if (!valid_llc_id(llc))
+ continue;
+
+ pcpu_time = per_cpu_ptr(mm->sc_stat.pcpu_sched, cpu);
+ runtime_ms = div_u64(pcpu_time->runtime, NSEC_PER_MSEC);
+ buf[llc] += runtime_ms;
+ }
+ rcu_read_unlock();
+
+ return 0;
+}
+
#else
static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* [PATCH v3 21/21] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (19 preceding siblings ...)
2026-02-10 22:19 ` [PATCH v3 20/21] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Tim Chen
@ 2026-02-10 22:19 ` Tim Chen
2026-02-19 14:08 ` [PATCH v3 00/21] Cache Aware Scheduling Qais Yousef
21 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-10 22:19 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
Debug patch only.
The user leverages these trace events (via bpftrace, etc.)
to monitor the cache-aware load balancing activity - specifically,
whether tasks are moved to their preferred LLC, moved out of their
preferred LLC, or whether cache-aware load balancing is skipped
due to exceeding the memory footprint limit or too many active
tasks.
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
Notes:
v2->v3:
Add more trace events when the process exceeds the limitation
of LLC size or number of active threads(moved from schedstat
to trace event for better bpf tracking)
include/trace/events/sched.h | 79 ++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 40 ++++++++++++++----
2 files changed, 110 insertions(+), 9 deletions(-)
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 7b2645b50e78..b73327653e4b 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -10,6 +10,85 @@
#include <linux/tracepoint.h>
#include <linux/binfmts.h>
+#ifdef CONFIG_SCHED_CACHE
+TRACE_EVENT(sched_exceed_llc_cap,
+
+ TP_PROTO(struct task_struct *t, int exceeded),
+
+ TP_ARGS(t, exceeded),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( int, exceeded )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+ __entry->pid = t->pid;
+ __entry->exceeded = exceeded;
+ ),
+
+ TP_printk("comm=%s pid=%d exceed_cap=%d",
+ __entry->comm, __entry->pid,
+ __entry->exceeded)
+);
+
+TRACE_EVENT(sched_exceed_llc_nr,
+
+ TP_PROTO(struct task_struct *t, int exceeded),
+
+ TP_ARGS(t, exceeded),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( int, exceeded )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+ __entry->pid = t->pid;
+ __entry->exceeded = exceeded;
+ ),
+
+ TP_printk("comm=%s pid=%d exceed_nr=%d",
+ __entry->comm, __entry->pid,
+ __entry->exceeded)
+);
+
+TRACE_EVENT(sched_attach_task,
+
+ TP_PROTO(struct task_struct *t, int pref_cpu, int pref_llc,
+ int attach_cpu, int attach_llc),
+
+ TP_ARGS(t, pref_cpu, pref_llc, attach_cpu, attach_llc),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ __field( int, pref_cpu )
+ __field( int, pref_llc )
+ __field( int, attach_cpu )
+ __field( int, attach_llc )
+ ),
+
+ TP_fast_assign(
+ memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+ __entry->pid = t->pid;
+ __entry->pref_cpu = pref_cpu;
+ __entry->pref_llc = pref_llc;
+ __entry->attach_cpu = attach_cpu;
+ __entry->attach_llc = attach_llc;
+ ),
+
+ TP_printk("comm=%s pid=%d pref_cpu=%d pref_llc=%d attach_cpu=%d attach_llc=%d",
+ __entry->comm, __entry->pid,
+ __entry->pref_cpu, __entry->pref_llc,
+ __entry->attach_cpu, __entry->attach_llc)
+);
+#endif
+
/*
* Tracepoint for calling kthread_stop, performed to end a kthread:
*/
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25cee3dd767c..977091fd0e49 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1245,9 +1245,11 @@ static inline int get_sched_cache_scale(int mul)
return (1 + (llc_aggr_tolerance - 1) * mul);
}
-static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu,
+ struct task_struct *p)
{
struct cacheinfo *ci;
+ bool exceeded;
u64 rss, llc;
int scale;
@@ -1293,12 +1295,18 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
if (scale == INT_MAX)
return false;
- return ((llc * scale) <= (rss * PAGE_SIZE));
+ exceeded = ((llc * scale) <= (rss * PAGE_SIZE));
+
+ trace_sched_exceed_llc_cap(p, exceeded);
+
+ return exceeded;
}
-static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
+static bool exceed_llc_nr(struct mm_struct *mm, int cpu,
+ struct task_struct *p)
{
int smt_nr = 1, scale;
+ bool exceeded;
#ifdef CONFIG_SCHED_SMT
if (sched_smt_active())
@@ -1313,8 +1321,12 @@ static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
if (scale == INT_MAX)
return false;
- return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
+ exceeded = !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
(scale * per_cpu(sd_llc_size, cpu)));
+
+ trace_sched_exceed_llc_nr(p, exceeded);
+
+ return exceeded;
}
static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
@@ -1522,8 +1534,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
if (time_after(epoch,
READ_ONCE(mm->sc_stat.epoch) + llc_epoch_affinity_timeout) ||
get_nr_threads(p) <= 1 ||
- exceed_llc_nr(mm, cpu_of(rq)) ||
- exceed_llc_capacity(mm, cpu_of(rq))) {
+ exceed_llc_nr(mm, cpu_of(rq), p) ||
+ exceed_llc_capacity(mm, cpu_of(rq), p)) {
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
}
@@ -1600,7 +1612,7 @@ static void task_cache_work(struct callback_head *work)
curr_cpu = task_cpu(p);
if (get_nr_threads(p) <= 1 ||
- exceed_llc_capacity(mm, curr_cpu)) {
+ exceed_llc_capacity(mm, curr_cpu, p)) {
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
@@ -10159,8 +10171,8 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
* Skip cache aware load balance for single/too many threads
* or large memory RSS.
*/
- if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
- exceed_llc_capacity(mm, dst_cpu)) {
+ if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu, p) ||
+ exceed_llc_capacity(mm, dst_cpu, p)) {
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
return mig_unrestricted;
@@ -10602,6 +10614,16 @@ static void attach_task(struct rq *rq, struct task_struct *p)
{
lockdep_assert_rq_held(rq);
+#ifdef CONFIG_SCHED_CACHE
+ if (p->mm) {
+ int pref_cpu = p->mm->sc_stat.cpu;
+
+ trace_sched_attach_task(p,
+ pref_cpu,
+ pref_cpu != -1 ? llc_id(pref_cpu) : -1,
+ cpu_of(rq), llc_id(cpu_of(rq)));
+ }
+#endif
WARN_ON_ONCE(task_rq(p) != rq);
activate_task(rq, p, ENQUEUE_NOCLOCK);
wakeup_preempt(rq, p, 0);
--
2.32.0
^ permalink raw reply related [flat|nested] 117+ messages in thread
* Re: [PATCH v3 01/21] sched/cache: Introduce infrastructure for cache-aware load balancing
2026-02-10 22:18 ` [PATCH v3 01/21] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
@ 2026-02-14 12:26 ` Madadi Vineeth Reddy
2026-02-14 15:34 ` Chen, Yu C
0 siblings, 1 reply; 117+ messages in thread
From: Madadi Vineeth Reddy @ 2026-02-14 12:26 UTC (permalink / raw)
To: Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Chen Yu,
Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef,
Libo Chen, linux-kernel, Madadi Vineeth Reddy
Hi Tim,
Thanks for the patch series.
On 11/02/26 03:48, Tim Chen wrote:
> From: "Peter Zijlstra (Intel)" <peterz@infradead.org>
>
> Adds infrastructure to enable cache-aware load balancing,
> which improves cache locality by grouping tasks that share resources
> within the same cache domain. This reduces cache misses and improves
> overall data access efficiency.
[..snip..]
> +void mm_init_sched(struct mm_struct *mm,
> + struct sched_cache_time __percpu *_pcpu_sched)
> +{
> + unsigned long epoch;
> + int i;
> +
> + for_each_possible_cpu(i) {
> + struct sched_cache_time *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
> + struct rq *rq = cpu_rq(i);
> +
> + pcpu_sched->runtime = 0;
> + pcpu_sched->epoch = rq->cpu_epoch;
> + epoch = rq->cpu_epoch;
Shouldn't cpu_epoch be read under cpu_epoch_lock, similar to how fraction_mm_sched()
and __update_mm_sched() acquire the lock before accessing this field?
Thanks,
Vineeth
> + }
> +
> + raw_spin_lock_init(&mm->sc_stat.lock);
> + mm->sc_stat.epoch = epoch;
> + mm->sc_stat.cpu = -1;
> +
> + /*
> + * The update to mm->sc_stat should not be reordered
> + * before initialization to mm's other fields, in case
> + * the readers may get invalid mm_sched_epoch, etc.
> + */
> + smp_store_release(&mm->sc_stat.pcpu_sched, _pcpu_sched);
> +}
[..snip..]
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 01/21] sched/cache: Introduce infrastructure for cache-aware load balancing
2026-02-14 12:26 ` Madadi Vineeth Reddy
@ 2026-02-14 15:34 ` Chen, Yu C
2026-02-17 18:51 ` Tim Chen
0 siblings, 1 reply; 117+ messages in thread
From: Chen, Yu C @ 2026-02-14 15:34 UTC (permalink / raw)
To: Madadi Vineeth Reddy, Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li,
Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef, Libo Chen,
linux-kernel
Hi Vineeth,
On 2/14/2026 8:26 PM, Madadi Vineeth Reddy wrote:
> Hi Tim,
> Thanks for the patch series.
>
> On 11/02/26 03:48, Tim Chen wrote:
>> From: "Peter Zijlstra (Intel)" <peterz@infradead.org>
>>
>> Adds infrastructure to enable cache-aware load balancing,
>> which improves cache locality by grouping tasks that share resources
>> within the same cache domain. This reduces cache misses and improves
>> overall data access efficiency.
>
> [..snip..]
>
>> +void mm_init_sched(struct mm_struct *mm,
>> + struct sched_cache_time __percpu *_pcpu_sched)
>> +{
>> + unsigned long epoch;
>> + int i;
>> +
>> + for_each_possible_cpu(i) {
>> + struct sched_cache_time *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
>> + struct rq *rq = cpu_rq(i);
>> +
>> + pcpu_sched->runtime = 0;
>> + pcpu_sched->epoch = rq->cpu_epoch;
>> + epoch = rq->cpu_epoch;
>
> Shouldn't cpu_epoch be read under cpu_epoch_lock, similar to how fraction_mm_sched()
> and __update_mm_sched() acquire the lock before accessing this field?
My understanding is that __update_mm_sched() updates rq->cpu_epoch in
two steps:
first, it reads the current value, and then it writes the new value back
to it
(as seen in the operation rq->cpu_epoch += n). For this reason, a lock
is required
to prevent race conditions during concurrent updates across multiple CPUs.
In contrast, reading rq->cpu_epoch in mm_init_sched() is a single atomic
operation,
and it is acceptable to read a stale value in this scenario - thus, we
can safely
perform an unprotected read of this field here.
thanks,
Chenyu
>
> Thanks,
> Vineeth
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 03/21] sched/cache: Introduce helper functions to enforce LLC migration policy
2026-02-10 22:18 ` [PATCH v3 03/21] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
@ 2026-02-14 16:12 ` Madadi Vineeth Reddy
2026-02-15 12:14 ` Chen, Yu C
2026-02-19 11:29 ` Peter Zijlstra
1 sibling, 1 reply; 117+ messages in thread
From: Madadi Vineeth Reddy @ 2026-02-14 16:12 UTC (permalink / raw)
To: Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Madadi Vineeth Reddy
On 11/02/26 03:48, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
>
> Cache-aware scheduling aggregates threads onto their preferred LLC,
> mainly through load balancing. When the preferred LLC becomes
> saturated, more threads are still placed there, increasing latency.
> A mechanism is needed to limit aggregation so that the preferred LLC
> does not become overloaded.
>
> Introduce helper functions can_migrate_llc() and
> can_migrate_llc_task() to enforce the LLC migration policy:
>
> 1. Aggregate a task to its preferred LLC if both source and
> destination LLCs are not too busy, or if doing so will not
> leave the preferred LLC much more imbalanced than the
> non-preferred one (>20% utilization difference, a little
> higher than imbalance_pct(17%) of the LLC domain as hysteresis).
> 2. Allow moving a task from overloaded preferred LLC to a non
> preferred LLC if this will not cause the non preferred LLC
> to become too imbalanced to cause a later migration back.
> 3. If both LLCs are too busy, let the generic load balance to
> spread the tasks.
>
> Further (hysteresis)action could be taken in the future to prevent tasks
> from being migrated into and out of the preferred LLC frequently (back and
> forth): the threshold for migrating a task out of its preferred LLC should
> be higher than that for migrating it into the LLC.
>
> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>
> Notes:
> v2->v3:
> No change.
>
> kernel/sched/fair.c | 153 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 153 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dfeb107f2cfd..bf5f39a01017 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9689,6 +9689,27 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
> }
>
> #ifdef CONFIG_SCHED_CACHE
> +/*
> + * The margin used when comparing LLC utilization with CPU capacity.
> + * It determines the LLC load level where active LLC aggregation is
> + * done.
> + * Derived from fits_capacity().
> + *
> + * (default: ~50%)
> + */
> +#define fits_llc_capacity(util, max) \
> + ((util) * 2 < (max))
> +
> +/*
> + * The margin used when comparing utilization.
> + * is 'util1' noticeably greater than 'util2'
> + * Derived from capacity_greater().
> + * Bias is in perentage.
> + */
> +/* Allows dst util to be bigger than src util by up to bias percent */
> +#define util_greater(util1, util2) \
> + ((util1) * 100 > (util2) * 120)
> +
> /* Called from load balancing paths with rcu_read_lock held */
> static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
> unsigned long *cap)
> @@ -9704,6 +9725,138 @@ static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
>
> return true;
> }
> +
> +/*
> + * Decision matrix according to the LLC utilization. To
> + * decide whether we can do task aggregation across LLC.
> + *
> + * By default, 50% is the threshold for treating the LLC
> + * as busy. The reason for choosing 50% is to avoid saturation
> + * of SMT-2, and it is also a safe cutoff for other SMT-n
> + * platforms.
> + *
> + * 20% is the utilization imbalance percentage to decide
> + * if the preferred LLC is busier than the non-preferred LLC.
> + * 20 is a little higher than the LLC domain's imbalance_pct
> + * 17. The hysteresis is used to avoid task bouncing between the
> + * preferred LLC and the non-preferred LLC.
> + *
> + * 1. moving towards the preferred LLC, dst is the preferred
> + * LLC, src is not.
> + *
> + * src \ dst 30% 40% 50% 60%
> + * 30% Y Y Y N
> + * 40% Y Y Y Y
> + * 50% Y Y G G
> + * 60% Y Y G G
> + *
According to this matrix (which I assume shows utilization after migration),
G is expected for src=50% and dst=50%. However, the code performs the "both
busy" check before adjusting src_util and dst_util:
if (!fits_llc_capacity(dst_util, dst_cap) &&
!fits_llc_capacity(src_util, src_cap))
return mig_unrestricted;
src_util = src_util - tsk_util;
dst_util = dst_util + tsk_util;
For example, with a 10% task migrating from src_util=60% to dst_util=40%:
The check evaluates: !fits(40) && !fits(60) = false && true = false
- Doesn't return mig_unrestricted
- After adjustment: src=50%, dst=50%
- Falls through to return mig_llc (Y)
But the matrix indicates 50%/50% should be G, not Y.
Moving this check after the utilization adjustment would make it consistent
with the documented matrix.
Thanks,
Vineeth
> + * 2. moving out of the preferred LLC, src is the preferred
> + * LLC, dst is not:
> + *
> + * src \ dst 30% 40% 50% 60%
> + * 30% N N N N
> + * 40% N N N N
> + * 50% N N G G
> + * 60% Y N G G
> + *
> + * src : src_util
> + * dst : dst_util
> + * Y : Yes, migrate
> + * N : No, do not migrate
> + * G : let the Generic load balance to even the load.
> + *
> + * The intention is that if both LLCs are quite busy, cache aware
> + * load balance should not be performed, and generic load balance
> + * should take effect. However, if one is busy and the other is not,
> + * the preferred LLC capacity(50%) and imbalance criteria(20%) should
> + * be considered to determine whether LLC aggregation should be
> + * performed to bias the load towards the preferred LLC.
> + */
> +
> +/* migration decision, 3 states are orthogonal. */
> +enum llc_mig {
> + mig_forbid = 0, /* N: Don't migrate task, respect LLC preference */
> + mig_llc, /* Y: Do LLC preference based migration */
> + mig_unrestricted /* G: Don't restrict generic load balance migration */
> +};
> +
> +/*
> + * Check if task can be moved from the source LLC to the
> + * destination LLC without breaking cache aware preferrence.
> + * src_cpu and dst_cpu are arbitrary CPUs within the source
> + * and destination LLCs, respectively.
> + */
> +static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
> + unsigned long tsk_util,
> + bool to_pref)
> +{
> + unsigned long src_util, dst_util, src_cap, dst_cap;
> +
> + if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
> + !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
> + return mig_unrestricted;
> +
> + if (!fits_llc_capacity(dst_util, dst_cap) &&
> + !fits_llc_capacity(src_util, src_cap))
> + return mig_unrestricted;
> +
> + src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
> + dst_util = dst_util + tsk_util;
> + if (to_pref) {
> + /*
> + * Don't migrate if we will get preferred LLC too
> + * heavily loaded and if the dest is much busier
> + * than the src, in which case migration will
> + * increase the imbalance too much.
> + */
> + if (!fits_llc_capacity(dst_util, dst_cap) &&
> + util_greater(dst_util, src_util))
> + return mig_forbid;
> + } else {
> + /*
> + * Don't migrate if we will leave preferred LLC
> + * too idle, or if this migration leads to the
> + * non-preferred LLC falls within sysctl_aggr_imb percent
> + * of preferred LLC, leading to migration again
> + * back to preferred LLC.
> + */
> + if (fits_llc_capacity(src_util, src_cap) ||
> + !util_greater(src_util, dst_util))
> + return mig_forbid;
> + }
> + return mig_llc;
> +}
> +
> +/*
> + * Check if task p can migrate from source LLC to
> + * destination LLC in terms of cache aware load balance.
> + */
> +static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
> + struct task_struct *p)
> +{
> + struct mm_struct *mm;
> + bool to_pref;
> + int cpu;
> +
> + mm = p->mm;
> + if (!mm)
> + return mig_unrestricted;
> +
> + cpu = mm->sc_stat.cpu;
> + if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
> + return mig_unrestricted;
> +
> + if (cpus_share_cache(dst_cpu, cpu))
> + to_pref = true;
> + else if (cpus_share_cache(src_cpu, cpu))
> + to_pref = false;
> + else
> + return mig_unrestricted;
> +
> + return can_migrate_llc(src_cpu, dst_cpu,
> + task_util(p), to_pref);
> +}
> +
> #else
> static inline bool get_llc_stats(int cpu, unsigned long *util,
> unsigned long *cap)
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-10 22:18 ` [PATCH v3 04/21] sched/cache: Make LLC id continuous Tim Chen
@ 2026-02-14 17:53 ` Madadi Vineeth Reddy
2026-02-15 14:25 ` Chen, Yu C
2026-02-16 7:44 ` K Prateek Nayak
` (2 subsequent siblings)
3 siblings, 1 reply; 117+ messages in thread
From: Madadi Vineeth Reddy @ 2026-02-14 17:53 UTC (permalink / raw)
To: Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Madadi Vineeth Reddy
On 11/02/26 03:48, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
>
> Introduce an index mapping between CPUs and their LLCs. This provides
> a continuous per LLC index needed for cache-aware load balancing in
> later patches.
>
> The existing per_cpu llc_id usually points to the first CPU of the
> LLC domain, which is sparse and unsuitable as an array index. Using
> llc_id directly would waste memory.
>
> With the new mapping, CPUs in the same LLC share a continuous id:
>
> per_cpu(llc_id, CPU=0...15) = 0
> per_cpu(llc_id, CPU=16...31) = 1
> per_cpu(llc_id, CPU=32...47) = 2
> ...
>
> Once a CPU has been assigned an llc_id, this ID persists even when
> the CPU is taken offline and brought back online, which can facilitate
> the management of the ID.
tl_max_llcs is never reset across multiple invocations of build_sched_domains().
While this preserves LLC IDs across normal CPU hotplug events, I'm wondering about
scenarios where hardware topology changes, such as physically removing/replacing
CPU sockets.
Example scenario:
Boot with 3 LLCs: IDs {0,1,2}, tl_max_llcs=3
Physical hardware change removes LLC 1
New hardware added at a different position gets ID=3
After multiple such events: System has 4 LLCs but IDs {0,2,5,7}, tl_max_llcs=8
This creates gaps in the ID space. However, I understand this trade-off might be
intentional since physical topology changes are rare, and resetting tl_max_llcs and
all sd_llc_id values would rebuild IDs on every invocation of build_sched_domains().
Would like to know your thoughts on overhead of resetting tl_max_llcs and sd_llc_id
so that IDs are rebuilt on each invocation of build_sched_domains() to always maintain
a dense mapping.
Thanks,
Vineeth
>
> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>
> Notes:
> v2->v3:
> Allocate the LLC id according to the topology level data directly, rather
> than calculating from the sched domain. This simplifies the code.
> (Peter Zijlstra, K Prateek Nayak)
>
> kernel/sched/topology.c | 47 ++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 44 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index cf643a5ddedd..ca46b5cf7f78 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -20,6 +20,7 @@ void sched_domains_mutex_unlock(void)
> /* Protected by sched_domains_mutex: */
> static cpumask_var_t sched_domains_tmpmask;
> static cpumask_var_t sched_domains_tmpmask2;
> +static int tl_max_llcs;
>
> static int __init sched_debug_setup(char *str)
> {
> @@ -658,7 +659,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
> */
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> -DEFINE_PER_CPU(int, sd_llc_id);
> +DEFINE_PER_CPU(int, sd_llc_id) = -1;
> DEFINE_PER_CPU(int, sd_share_id);
> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
>
> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> per_cpu(sd_llc_size, cpu) = size;
> - per_cpu(sd_llc_id, cpu) = id;
> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>
> sd = lowest_flag_domain(cpu, SD_CLUSTER);
> @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>
> /* Set up domains for CPUs specified by the cpu_map: */
> for_each_cpu(i, cpu_map) {
> - struct sched_domain_topology_level *tl;
> + struct sched_domain_topology_level *tl, *tl_llc = NULL;
> + int lid;
>
> sd = NULL;
> for_each_sd_topology(tl) {
> + int flags = 0;
> +
> + if (tl->sd_flags)
> + flags = (*tl->sd_flags)();
> +
> + if (flags & SD_SHARE_LLC)
> + tl_llc = tl;
>
> sd = build_sched_domain(tl, cpu_map, attr, sd, i);
>
> @@ -2581,6 +2589,39 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> if (cpumask_equal(cpu_map, sched_domain_span(sd)))
> break;
> }
> +
> + lid = per_cpu(sd_llc_id, i);
> + if (lid == -1) {
> + int j;
> +
> + /*
> + * Assign the llc_id to the CPUs that do not
> + * have an LLC.
> + */
> + if (!tl_llc) {
> + per_cpu(sd_llc_id, i) = tl_max_llcs++;
> +
> + continue;
> + }
> +
> + /* try to reuse the llc_id of its siblings */
> + for_each_cpu(j, tl_llc->mask(tl_llc, i)) {
> + if (i == j)
> + continue;
> +
> + lid = per_cpu(sd_llc_id, j);
> +
> + if (lid != -1) {
> + per_cpu(sd_llc_id, i) = lid;
> +
> + break;
> + }
> + }
> +
> + /* a new LLC is detected */
> + if (lid == -1)
> + per_cpu(sd_llc_id, i) = tl_max_llcs++;
> + }
> }
>
> if (WARN_ON(!topology_span_sane(cpu_map)))
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 05/21] sched/cache: Assign preferred LLC ID to processes
2026-02-10 22:18 ` [PATCH v3 05/21] sched/cache: Assign preferred LLC ID to processes Tim Chen
@ 2026-02-14 18:36 ` Madadi Vineeth Reddy
2026-02-16 6:58 ` Chen, Yu C
0 siblings, 1 reply; 117+ messages in thread
From: Madadi Vineeth Reddy @ 2026-02-14 18:36 UTC (permalink / raw)
To: Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Chen Yu,
Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef,
Libo Chen, linux-kernel, Madadi Vineeth Reddy
On 11/02/26 03:48, Tim Chen wrote:
> With cache-aware scheduling enabled, each task is assigned a
> preferred LLC ID. This allows quick identification of the LLC domain
> where the task prefers to run, similar to numa_preferred_nid in
> NUMA balancing.
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>
> Notes:
> v2->v3:
> Add comments around code handling NUMA balance conflict with cache aware
> scheduling. (Peter Zijlstra)
>
> Check if NUMA balancing is disabled before checking numa_preferred_nid
> (Jianyong Wu)
>
> include/linux/sched.h | 1 +
> init/init_task.c | 3 +++
> kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 46 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2817a21ee055..c98bd1c46088 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1411,6 +1411,7 @@ struct task_struct {
>
> #ifdef CONFIG_SCHED_CACHE
> struct callback_head cache_work;
> + int preferred_llc;
> #endif
>
> struct rseq_data rseq;
> diff --git a/init/init_task.c b/init/init_task.c
> index 49b13d7c3985..baa420de2644 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -218,6 +218,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
> .numa_group = NULL,
> .numa_faults = NULL,
> #endif
> +#ifdef CONFIG_SCHED_CACHE
> + .preferred_llc = -1,
> +#endif
> #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
> .kasan_depth = 1,
> #endif
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bf5f39a01017..0b4ed0f2809d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1273,11 +1273,43 @@ static unsigned long fraction_mm_sched(struct rq *rq,
> return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
> }
>
> +static int get_pref_llc(struct task_struct *p, struct mm_struct *mm)
> +{
> + int mm_sched_llc = -1;
> +
> + if (!mm)
> + return -1;
> +
> + if (mm->sc_stat.cpu != -1) {
> + mm_sched_llc = llc_id(mm->sc_stat.cpu);
> +
> +#ifdef CONFIG_NUMA_BALANCING
> + /*
> + * Don't assign preferred LLC if it
> + * conflicts with NUMA balancing.
> + * This can happen when sched_setnuma() gets
> + * called, however it is not much of an issue
> + * because we expect account_mm_sched() to get
> + * called fairly regularly -- at a higher rate
> + * than sched_setnuma() at least -- and thus the
> + * conflict only exists for a short period of time.
> + */
> + if (static_branch_likely(&sched_numa_balancing) &&
> + p->numa_preferred_nid >= 0 &&
> + cpu_to_node(mm->sc_stat.cpu) != p->numa_preferred_nid)
> + mm_sched_llc = -1;
> +#endif
> + }
> +
> + return mm_sched_llc;
> +}
> +
> static inline
> void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> {
> struct sched_cache_time *pcpu_sched;
> struct mm_struct *mm = p->mm;
> + int mm_sched_llc = -1;
> unsigned long epoch;
>
> if (!sched_cache_enabled())
> @@ -1311,6 +1343,11 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> if (mm->sc_stat.cpu != -1)
> mm->sc_stat.cpu = -1;
> }
> +
> + mm_sched_llc = get_pref_llc(p, mm);
> +
> + if (p->preferred_llc != mm_sched_llc)
> + p->preferred_llc = mm_sched_llc;
This writes to p->preferred_llc without using WRITE_ONCE(). If later patches read p->preferred_llc from
load balancing or migration paths on other CPUs, wouldn't this create a data race?
For example:
CPU 0: Task is running, account_mm_sched() writes p->preferred_llc
CPU 1: Load balancer reads p->preferred_llc to make migration decisions
Should this use WRITE_ONCE() and READ_ONCE() at the read sites, unless all accesses are guaranteed to be
under rq->lock?
Thanks,
Vineeth
> }
>
> static void task_tick_cache(struct rq *rq, struct task_struct *p)
> @@ -1440,6 +1477,11 @@ void init_sched_mm(struct task_struct *p) { }
>
> static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
>
> +static inline int get_pref_llc(struct task_struct *p,
> + struct mm_struct *mm)
> +{
> + return -1;
> +}
> #endif
>
> /*
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 03/21] sched/cache: Introduce helper functions to enforce LLC migration policy
2026-02-14 16:12 ` Madadi Vineeth Reddy
@ 2026-02-15 12:14 ` Chen, Yu C
0 siblings, 0 replies; 117+ messages in thread
From: Chen, Yu C @ 2026-02-15 12:14 UTC (permalink / raw)
To: Madadi Vineeth Reddy, Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li,
Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef, Libo Chen,
linux-kernel
On 2/15/2026 12:12 AM, Madadi Vineeth Reddy wrote:
> On 11/02/26 03:48, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>
[ ... ]
>
> According to this matrix (which I assume shows utilization after migration),
> G is expected for src=50% and dst=50%. However, the code performs the "both
> busy" check before adjusting src_util and dst_util:
>
> if (!fits_llc_capacity(dst_util, dst_cap) &&
> !fits_llc_capacity(src_util, src_cap))
> return mig_unrestricted;
>
> src_util = src_util - tsk_util;
> dst_util = dst_util + tsk_util;
>
> For example, with a 10% task migrating from src_util=60% to dst_util=40%:
>
> The check evaluates: !fits(40) && !fits(60) = false && true = false
> - Doesn't return mig_unrestricted
> - After adjustment: src=50%, dst=50%
> - Falls through to return mig_llc (Y)
>
> But the matrix indicates 50%/50% should be G, not Y.
>
> Moving this check after the utilization adjustment would make it consistent
> with the documented matrix.
>
Right, I previously considered only the snapshot of src/dst_util without
accounting for the migrating task. To keep this consistent with the
decision
matrix, I will adjust the sequence as you suggested.
thanks,
Chenyu
> Thanks,
> Vineeth
>
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-14 17:53 ` Madadi Vineeth Reddy
@ 2026-02-15 14:25 ` Chen, Yu C
2026-02-17 10:05 ` Madadi Vineeth Reddy
0 siblings, 1 reply; 117+ messages in thread
From: Chen, Yu C @ 2026-02-15 14:25 UTC (permalink / raw)
To: Madadi Vineeth Reddy
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li,
Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef, Libo Chen,
linux-kernel, Tim Chen
On 2/15/2026 1:53 AM, Madadi Vineeth Reddy wrote:
> On 11/02/26 03:48, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> Introduce an index mapping between CPUs and their LLCs. This provides
>> a continuous per LLC index needed for cache-aware load balancing in
>> later patches.
>>
>> The existing per_cpu llc_id usually points to the first CPU of the
>> LLC domain, which is sparse and unsuitable as an array index. Using
>> llc_id directly would waste memory.
>>
>> With the new mapping, CPUs in the same LLC share a continuous id:
>>
>> per_cpu(llc_id, CPU=0...15) = 0
>> per_cpu(llc_id, CPU=16...31) = 1
>> per_cpu(llc_id, CPU=32...47) = 2
>> ...
>>
>> Once a CPU has been assigned an llc_id, this ID persists even when
>> the CPU is taken offline and brought back online, which can facilitate
>> the management of the ID.
>
> tl_max_llcs is never reset across multiple invocations of build_sched_domains().
> While this preserves LLC IDs across normal CPU hotplug events, I'm wondering about
> scenarios where hardware topology changes, such as physically removing/replacing
> CPU sockets.
>
> Example scenario:
> Boot with 3 LLCs: IDs {0,1,2}, tl_max_llcs=3
> Physical hardware change removes LLC 1
> New hardware added at a different position gets ID=3
> After multiple such events: System has 4 LLCs but IDs {0,2,5,7}, tl_max_llcs=8
>
I agree that keeping tl_max_llcs non-decreasing might waste some space. The
original motivation for introducing a dynamic sd_llc_id was mainly that a
static sd_llc_id[NR_LLC] is not suitable, as we cannot find a proper upper
limit for NR_LLC-an arbitrary value for NR_LLC is unacceptable. That is to
say, tl_max_llcs serves as the historical maximum LLC index that has ever
been detected - like other terms such as CPU id. It is possible that the
number of available LLCs shrinks due to CPU offline after boot-up. A value
of tl_max_llcs=8 indicates that this system once had 8 valid LLCs. On the
other hand, dense mapping is a side effect of dynamically allocating
sd_llc_id.
> This creates gaps in the ID space. However, I understand this trade-off might be
> intentional since physical topology changes are rare, and resetting tl_max_llcs and
> all sd_llc_id values would rebuild IDs on every invocation of build_sched_domains().
>
> Would like to know your thoughts on overhead of resetting tl_max_llcs and sd_llc_id
> so that IDs are rebuilt on each invocation of build_sched_domains() to always maintain
> a dense mapping.
>
The current implementation is intentionally kept simple for easier
review, and
I agree that strictly enforcing a dense mapping for sd_llc_id - by
recalculating
the actual maximum LLC count (max_llcs) whenever the CPU topology
changes - could
be an optimization direction once the basic version has been accepted. I
assume what
you are suggesting is that we could reset tl_max_llcs/max_llcs/sd_llc_id
for CPUs
in doms_new[i] within partition_sched_domains_locked() - and then
rebuild these
values in build_sched_domains() accordingly. One risk here is a race
condition when
modifying the llc_id of a specific CPU - but off the top of my head,
valid_llc_buf()
should help prevent out-of-range access to sd->pf caused by such races.
Thoughts?
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 05/21] sched/cache: Assign preferred LLC ID to processes
2026-02-14 18:36 ` Madadi Vineeth Reddy
@ 2026-02-16 6:58 ` Chen, Yu C
0 siblings, 0 replies; 117+ messages in thread
From: Chen, Yu C @ 2026-02-16 6:58 UTC (permalink / raw)
To: Madadi Vineeth Reddy
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li,
Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef, Libo Chen,
linux-kernel, Tim Chen
On 2/15/2026 2:36 AM, Madadi Vineeth Reddy wrote:
> On 11/02/26 03:48, Tim Chen wrote:
[ ... ]
>> @@ -1311,6 +1343,11 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>> if (mm->sc_stat.cpu != -1)
>> mm->sc_stat.cpu = -1;
>> }
>> +
>> + mm_sched_llc = get_pref_llc(p, mm);
>> +
>> + if (p->preferred_llc != mm_sched_llc)
>> + p->preferred_llc = mm_sched_llc;
>
> This writes to p->preferred_llc without using WRITE_ONCE(). If later patches read p->preferred_llc from
> load balancing or migration paths on other CPUs, wouldn't this create a data race?
>
I suppose you are referring to data inconsistency between CPUs, as
READ/WRITE_ONCE()
make sure that the latest data is always read from/written to memory rather
than from/to registers.
> For example:
> CPU 0: Task is running, account_mm_sched() writes p->preferred_llc
> CPU 1: Load balancer reads p->preferred_llc to make migration decisions
>
> Should this use WRITE_ONCE() and READ_ONCE() at the read sites, unless all accesses are guaranteed to be
> under rq->lock?
>
Actually, I found that p->preferred_llc is only read during task
enqueue/dequeue in
account_llc_enqueue() and account_llc_dequeue(), which are protected by
the rq lock.
However, after reviewing the code again, I noticed that in
migrate_degrades_llc()
(part of the load balance logic), we should check the task's
preferred_llc instead
of the task's current LLC obtained via task_llc(p). We will therefore
switch to
using READ/WRITE_ONCE for accesses to this variable in
migrate_degrades_llc().
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-10 22:18 ` [PATCH v3 04/21] sched/cache: Make LLC id continuous Tim Chen
2026-02-14 17:53 ` Madadi Vineeth Reddy
@ 2026-02-16 7:44 ` K Prateek Nayak
2026-02-17 6:07 ` Chen, Yu C
2026-02-19 15:40 ` Peter Zijlstra
2026-02-19 11:35 ` Peter Zijlstra
2026-02-19 14:59 ` Peter Zijlstra
3 siblings, 2 replies; 117+ messages in thread
From: K Prateek Nayak @ 2026-02-16 7:44 UTC (permalink / raw)
To: Tim Chen, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy,
Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
Hello Tim, Chenyu,
On 2/11/2026 3:48 AM, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
>
> Introduce an index mapping between CPUs and their LLCs. This provides
> a continuous per LLC index needed for cache-aware load balancing in
> later patches.
>
> The existing per_cpu llc_id usually points to the first CPU of the
> LLC domain, which is sparse and unsuitable as an array index. Using
> llc_id directly would waste memory.
>
> With the new mapping, CPUs in the same LLC share a continuous id:
>
> per_cpu(llc_id, CPU=0...15) = 0
> per_cpu(llc_id, CPU=16...31) = 1
> per_cpu(llc_id, CPU=32...47) = 2
> ...
>
> Once a CPU has been assigned an llc_id, this ID persists even when
> the CPU is taken offline and brought back online, which can facilitate
> the management of the ID.
>
> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>
> Notes:
> v2->v3:
> Allocate the LLC id according to the topology level data directly, rather
> than calculating from the sched domain. This simplifies the code.
> (Peter Zijlstra, K Prateek Nayak)
>
> kernel/sched/topology.c | 47 ++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 44 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index cf643a5ddedd..ca46b5cf7f78 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -20,6 +20,7 @@ void sched_domains_mutex_unlock(void)
> /* Protected by sched_domains_mutex: */
> static cpumask_var_t sched_domains_tmpmask;
> static cpumask_var_t sched_domains_tmpmask2;
> +static int tl_max_llcs;
>
> static int __init sched_debug_setup(char *str)
> {
> @@ -658,7 +659,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
> */
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> -DEFINE_PER_CPU(int, sd_llc_id);
> +DEFINE_PER_CPU(int, sd_llc_id) = -1;
> DEFINE_PER_CPU(int, sd_share_id);
> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
>
> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> per_cpu(sd_llc_size, cpu) = size;
> - per_cpu(sd_llc_id, cpu) = id;
> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>
> sd = lowest_flag_domain(cpu, SD_CLUSTER);
> @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>
> /* Set up domains for CPUs specified by the cpu_map: */
> for_each_cpu(i, cpu_map) {
> - struct sched_domain_topology_level *tl;
> + struct sched_domain_topology_level *tl, *tl_llc = NULL;
> + int lid;
>
> sd = NULL;
> for_each_sd_topology(tl) {
> + int flags = 0;
> +
> + if (tl->sd_flags)
> + flags = (*tl->sd_flags)();
> +
> + if (flags & SD_SHARE_LLC)
> + tl_llc = tl;
nit. This loop breaks out when sched_domain_span(sd) covers the entire
cpu_map and it might have not reached the topmost SD_SHARE_LLC domain
yet. Is that cause for any concern?
>
> sd = build_sched_domain(tl, cpu_map, attr, sd, i);
>
> @@ -2581,6 +2589,39 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> if (cpumask_equal(cpu_map, sched_domain_span(sd)))
> break;
> }
> +
> + lid = per_cpu(sd_llc_id, i);
> + if (lid == -1) {
> + int j;
> +
> + /*
> + * Assign the llc_id to the CPUs that do not
> + * have an LLC.
> + */
> + if (!tl_llc) {
> + per_cpu(sd_llc_id, i) = tl_max_llcs++;
> +
> + continue;
> + }
> +
> + /* try to reuse the llc_id of its siblings */
> + for_each_cpu(j, tl_llc->mask(tl_llc, i)) {
My only large concern that remains is the fact that offline CPUs are
taken out the the tl->mask() which can lead to interesting cases where
CPUs on same LLC can have different llc_id:
o Boot with maxcpus=1
o Run:
for i in {1..$NRCPUS}; do
echo 1 > /sys/devices/system/cpu/cpu$i/online;
echo 0 > /sys/devices/system/cpu/cpu$i/online;
done
o Finally run:
echo 1 | tee /sys/devices/system/cpu/cpu*/online;
Once all CPUs are online, only the CPUs in boot CPU's LLC will have
the same llc_id. Every other CPU will have a unique llc_id which might
make the system behave unexpectedly.
I'm wondering if we can do something like below on top of this patch:
(Only build tested; Prepared on top of this patch in Tim's tree)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c6efa71cf500..aee1be89ab4c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8268,6 +8268,8 @@ static void cpuset_cpu_active(void)
static void cpuset_cpu_inactive(unsigned int cpu)
{
if (!cpuhp_tasks_frozen) {
+ /* XXX: Is this the right spot? */
+ sched_domains_free_llc_id(cpu);
cpuset_update_active_cpus();
} else {
num_cpus_frozen++;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de5b701c3950..31a8910297c7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3903,6 +3903,7 @@ static inline bool sched_cache_enabled(void)
}
#endif
extern void init_sched_mm(struct task_struct *p);
+void sched_domains_free_llc_id(int cpu);
extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ca46b5cf7f78..04c1ab489ee2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -18,6 +18,7 @@ void sched_domains_mutex_unlock(void)
}
/* Protected by sched_domains_mutex: */
+static cpumask_var_t sched_domains_llc_id_allocmask;
static cpumask_var_t sched_domains_tmpmask;
static cpumask_var_t sched_domains_tmpmask2;
static int tl_max_llcs;
@@ -2543,6 +2544,53 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
return true;
}
+static int __sched_domains_alloc_llc_id(void)
+{
+ int lid;
+
+ lockdep_assert_held(&sched_domains_mutex);
+
+ lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
+ if (lid >= tl_max_llcs)
+ tl_max_llcs++;
+
+ /*
+ * llc_id space should never grow larger than the
+ * possible number of CPUs in the system.
+ */
+ if (!unlikely(WARN_ON_ONCE(lid >= nr_cpumask_bits)))
+ cpumask_set_cpu(lid, sched_domains_llc_id_allocmask);
+ return lid;
+}
+
+static void __sched_domains_free_llc_id(int cpu)
+{
+ int i, lid;
+
+ lockdep_assert_held(&sched_domains_mutex);
+
+ lid = per_cpu(sd_llc_id, cpu);
+ if (lid == -1)
+ return;
+
+ per_cpu(sd_llc_id, cpu) = -1;
+
+ for_each_online_cpu(i) {
+ /* An online CPU owns the llc_id. */
+ if (per_cpu(sd_llc_id, i) == lid)
+ return;
+ }
+
+ cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
+}
+
+void sched_domains_free_llc_id(int cpu)
+{
+ sched_domains_mutex_lock();
+ __sched_domains_free_llc_id(cpu);
+ sched_domains_mutex_unlock();
+}
+
/*
* Build sched domains for a given set of CPUs and attach the sched domains
* to the individual CPUs
@@ -2599,7 +2647,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
* have an LLC.
*/
if (!tl_llc) {
- per_cpu(sd_llc_id, i) = tl_max_llcs++;
+ per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
continue;
}
@@ -2620,7 +2668,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* a new LLC is detected */
if (lid == -1)
- per_cpu(sd_llc_id, i) = tl_max_llcs++;
+ per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
}
}
@@ -2798,6 +2846,7 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
{
int err;
+ zalloc_cpumask_var(&sched_domains_llc_id_allocmask, GFP_KERNEL);
zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
---
It doesn't compact tl_max_llcs, but it should promote reuse of llc_id if
all CPUs of a LLC go offline. I know it is a ridiculous scenario but it
is possible nonetheless.
I'll let Peter and Valentin be the judge of additional space and
complexity needed for these bits :-)
> + if (i == j)
> + continue;
> +
> + lid = per_cpu(sd_llc_id, j);
> +
> + if (lid != -1) {
> + per_cpu(sd_llc_id, i) = lid;
> +
> + break;
> + }
> + }
> +
> + /* a new LLC is detected */
> + if (lid == -1)
> + per_cpu(sd_llc_id, i) = tl_max_llcs++;
> + }
> }
>
> if (WARN_ON(!topology_span_sane(cpu_map)))
--
Thanks and Regards,
Prateek
^ permalink raw reply related [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-16 7:44 ` K Prateek Nayak
@ 2026-02-17 6:07 ` Chen, Yu C
2026-02-17 8:09 ` K Prateek Nayak
2026-02-19 15:40 ` Peter Zijlstra
1 sibling, 1 reply; 117+ messages in thread
From: Chen, Yu C @ 2026-02-17 6:07 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Tim Chen, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
Hi Prateek,
On 2/16/2026 3:44 PM, K Prateek Nayak wrote:
> Hello Tim, Chenyu,
>
> On 2/11/2026 3:48 AM, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> Introduce an index mapping between CPUs and their LLCs. This provides
>> a continuous per LLC index needed for cache-aware load balancing in
>> later patches.
>>
>> The existing per_cpu llc_id usually points to the first CPU of the
>> LLC domain, which is sparse and unsuitable as an array index. Using
>> llc_id directly would waste memory.
>>
>> With the new mapping, CPUs in the same LLC share a continuous id:
>>
>> per_cpu(llc_id, CPU=0...15) = 0
>> per_cpu(llc_id, CPU=16...31) = 1
>> per_cpu(llc_id, CPU=32...47) = 2
>> ...
>>
>> Once a CPU has been assigned an llc_id, this ID persists even when
>> the CPU is taken offline and brought back online, which can facilitate
>> the management of the ID.
>>
>> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> ---
>>
>> Notes:
>> v2->v3:
>> Allocate the LLC id according to the topology level data directly, rather
>> than calculating from the sched domain. This simplifies the code.
>> (Peter Zijlstra, K Prateek Nayak)
>>
>> kernel/sched/topology.c | 47 ++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 44 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index cf643a5ddedd..ca46b5cf7f78 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -20,6 +20,7 @@ void sched_domains_mutex_unlock(void)
>> /* Protected by sched_domains_mutex: */
>> static cpumask_var_t sched_domains_tmpmask;
>> static cpumask_var_t sched_domains_tmpmask2;
>> +static int tl_max_llcs;
>>
>> static int __init sched_debug_setup(char *str)
>> {
>> @@ -658,7 +659,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
>> */
>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>> DEFINE_PER_CPU(int, sd_llc_size);
>> -DEFINE_PER_CPU(int, sd_llc_id);
>> +DEFINE_PER_CPU(int, sd_llc_id) = -1;
>> DEFINE_PER_CPU(int, sd_share_id);
>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>> @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
>>
>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>> per_cpu(sd_llc_size, cpu) = size;
>> - per_cpu(sd_llc_id, cpu) = id;
>> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>>
>> sd = lowest_flag_domain(cpu, SD_CLUSTER);
>> @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>
>> /* Set up domains for CPUs specified by the cpu_map: */
>> for_each_cpu(i, cpu_map) {
>> - struct sched_domain_topology_level *tl;
>> + struct sched_domain_topology_level *tl, *tl_llc = NULL;
>> + int lid;
>>
>> sd = NULL;
>> for_each_sd_topology(tl) {
>> + int flags = 0;
>> +
>> + if (tl->sd_flags)
>> + flags = (*tl->sd_flags)();
>> +
>> + if (flags & SD_SHARE_LLC)
>> + tl_llc = tl;
>
> nit. This loop breaks out when sched_domain_span(sd) covers the entire
> cpu_map and it might have not reached the topmost SD_SHARE_LLC domain
> yet. Is that cause for any concern?
>
Could you please elaborate a little more on this? If it covers the
entire cpu_map shouldn't it stop going up to its parent domain?
Do you mean, sd_llc_1 and its parent sd_llc_2 could cover the same cpu_map,
and we should let tl_llc to assigned to sd_llc_2 (sd_llc_1 be degenerated? )
>>
>> sd = build_sched_domain(tl, cpu_map, attr, sd, i);
>>
>> @@ -2581,6 +2589,39 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>> if (cpumask_equal(cpu_map, sched_domain_span(sd)))
>> break;
>> }
>> +
>> + lid = per_cpu(sd_llc_id, i);
>> + if (lid == -1) {
>> + int j;
>> +
>> + /*
>> + * Assign the llc_id to the CPUs that do not
>> + * have an LLC.
>> + */
>> + if (!tl_llc) {
>> + per_cpu(sd_llc_id, i) = tl_max_llcs++;
>> +
>> + continue;
>> + }
>> +
>> + /* try to reuse the llc_id of its siblings */
>> + for_each_cpu(j, tl_llc->mask(tl_llc, i)) {
>
>
> My only large concern that remains is the fact that offline CPUs are
> taken out the the tl->mask() which can lead to interesting cases where
> CPUs on same LLC can have different llc_id:
>
> o Boot with maxcpus=1
>
> o Run:
>
> for i in {1..$NRCPUS}; do
> echo 1 > /sys/devices/system/cpu/cpu$i/online;
> echo 0 > /sys/devices/system/cpu/cpu$i/online;
> done
>
> o Finally run:
>
> echo 1 | tee /sys/devices/system/cpu/cpu*/online;
>
>
> Once all CPUs are online, only the CPUs in boot CPU's LLC will have
> the same llc_id. Every other CPU will have a unique llc_id which might
> make the system behave unexpectedly.
>
You are right, I did not realize that the tl->mask would be unreliable
for detecting offline CPUs, and this case is brilliant for exposing
the bug in current code, nice catch!
> I'm wondering if we can do something like below on top of this patch:
>
> (Only build tested; Prepared on top of this patch in Tim's tree)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c6efa71cf500..aee1be89ab4c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8268,6 +8268,8 @@ static void cpuset_cpu_active(void)
> static void cpuset_cpu_inactive(unsigned int cpu)
> {
> if (!cpuhp_tasks_frozen) {
> + /* XXX: Is this the right spot? */
> + sched_domains_free_llc_id(cpu);
> cpuset_update_active_cpus();
> } else {
> num_cpus_frozen++;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index de5b701c3950..31a8910297c7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3903,6 +3903,7 @@ static inline bool sched_cache_enabled(void)
> }
> #endif
> extern void init_sched_mm(struct task_struct *p);
> +void sched_domains_free_llc_id(int cpu);
>
> extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
> extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index ca46b5cf7f78..04c1ab489ee2 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -18,6 +18,7 @@ void sched_domains_mutex_unlock(void)
> }
>
> /* Protected by sched_domains_mutex: */
> +static cpumask_var_t sched_domains_llc_id_allocmask;
> static cpumask_var_t sched_domains_tmpmask;
> static cpumask_var_t sched_domains_tmpmask2;
> static int tl_max_llcs;
> @@ -2543,6 +2544,53 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
> return true;
> }
>
> +static int __sched_domains_alloc_llc_id(void)
> +{
> + int lid;
> +
> + lockdep_assert_held(&sched_domains_mutex);
> +
> + lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
> + if (lid >= tl_max_llcs)
> + tl_max_llcs++;
> +
> + /*
> + * llc_id space should never grow larger than the
> + * possible number of CPUs in the system.
> + */
> + if (!unlikely(WARN_ON_ONCE(lid >= nr_cpumask_bits)))
> + cpumask_set_cpu(lid, sched_domains_llc_id_allocmask);
> + return lid;
> +}
> +
> +static void __sched_domains_free_llc_id(int cpu)
> +{
> + int i, lid;
> +
> + lockdep_assert_held(&sched_domains_mutex);
> +
> + lid = per_cpu(sd_llc_id, cpu);
> + if (lid == -1)
> + return;
> +
> + per_cpu(sd_llc_id, cpu) = -1;
> +
> + for_each_online_cpu(i) {
> + /* An online CPU owns the llc_id. */
> + if (per_cpu(sd_llc_id, i) == lid)
> + return;
> + }
> +
> + cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
> +}
> +
> +void sched_domains_free_llc_id(int cpu)
> +{
> + sched_domains_mutex_lock();
> + __sched_domains_free_llc_id(cpu);
> + sched_domains_mutex_unlock();
> +}
> +
> /*
> * Build sched domains for a given set of CPUs and attach the sched domains
> * to the individual CPUs
> @@ -2599,7 +2647,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> * have an LLC.
> */
> if (!tl_llc) {
> - per_cpu(sd_llc_id, i) = tl_max_llcs++;
> + per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
>
> continue;
> }
> @@ -2620,7 +2668,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>
> /* a new LLC is detected */
> if (lid == -1)
> - per_cpu(sd_llc_id, i) = tl_max_llcs++;
> + per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
> }
> }
>
> @@ -2798,6 +2846,7 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
> {
> int err;
>
> + zalloc_cpumask_var(&sched_domains_llc_id_allocmask, GFP_KERNEL);
> zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
> zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
> zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
> ---
>
> It doesn't compact tl_max_llcs, but it should promote reuse of llc_id if
> all CPUs of a LLC go offline. I know it is a ridiculous scenario but it
> is possible nonetheless.
>
> I'll let Peter and Valentin be the judge of additional space and
> complexity needed for these bits :-)
>
Smart approach! Dynamically reallocating the llc_id should be feasible,
as it releases the llc_id when the last CPU of that LLC is offlined. My
only concern is data synchronization issues arising from the reuse of
llc_id during load balancing - I’ll audit the logic to check for any race
conditions. Alternatively, what if we introduce a tl->static_mask? It would
be similar to tl->mask, but would not remove CPUs from static_mask when
they
are offlined. This way, we can always find and reuse the llc_id of CPUs in
that LLC (even if all CPUs in the LLC have been offlined at some point,
provided they were once online), and we would thus maintain a static llc_id.
Anyway, let do some testings on your proposal as well as static_mask things,
and I'll reply to this thread later. Thanks for the insights!
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-17 6:07 ` Chen, Yu C
@ 2026-02-17 8:09 ` K Prateek Nayak
2026-02-17 23:12 ` Tim Chen
` (2 more replies)
0 siblings, 3 replies; 117+ messages in thread
From: K Prateek Nayak @ 2026-02-17 8:09 UTC (permalink / raw)
To: Chen, Yu C
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Tim Chen, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
Hello Chenyu,
On 2/17/2026 11:37 AM, Chen, Yu C wrote:
> Hi Prateek,
>
> On 2/16/2026 3:44 PM, K Prateek Nayak wrote:
>> Hello Tim, Chenyu,
>>
>> On 2/11/2026 3:48 AM, Tim Chen wrote:
>>> From: Chen Yu <yu.c.chen@intel.com>
>>>
>>> Introduce an index mapping between CPUs and their LLCs. This provides
>>> a continuous per LLC index needed for cache-aware load balancing in
>>> later patches.
>>>
>>> The existing per_cpu llc_id usually points to the first CPU of the
>>> LLC domain, which is sparse and unsuitable as an array index. Using
>>> llc_id directly would waste memory.
>>>
>>> With the new mapping, CPUs in the same LLC share a continuous id:
>>>
>>> per_cpu(llc_id, CPU=0...15) = 0
>>> per_cpu(llc_id, CPU=16...31) = 1
>>> per_cpu(llc_id, CPU=32...47) = 2
>>> ...
>>>
>>> Once a CPU has been assigned an llc_id, this ID persists even when
>>> the CPU is taken offline and brought back online, which can facilitate
>>> the management of the ID.
>>>
>>> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>> ---
>>>
>>> Notes:
>>> v2->v3:
>>> Allocate the LLC id according to the topology level data directly, rather
>>> than calculating from the sched domain. This simplifies the code.
>>> (Peter Zijlstra, K Prateek Nayak)
>>>
>>> kernel/sched/topology.c | 47 ++++++++++++++++++++++++++++++++++++++---
>>> 1 file changed, 44 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index cf643a5ddedd..ca46b5cf7f78 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -20,6 +20,7 @@ void sched_domains_mutex_unlock(void)
>>> /* Protected by sched_domains_mutex: */
>>> static cpumask_var_t sched_domains_tmpmask;
>>> static cpumask_var_t sched_domains_tmpmask2;
>>> +static int tl_max_llcs;
>>> static int __init sched_debug_setup(char *str)
>>> {
>>> @@ -658,7 +659,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
>>> */
>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>> DEFINE_PER_CPU(int, sd_llc_size);
>>> -DEFINE_PER_CPU(int, sd_llc_id);
>>> +DEFINE_PER_CPU(int, sd_llc_id) = -1;
>>> DEFINE_PER_CPU(int, sd_share_id);
>>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>> @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>> per_cpu(sd_llc_size, cpu) = size;
>>> - per_cpu(sd_llc_id, cpu) = id;
>>> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>>> sd = lowest_flag_domain(cpu, SD_CLUSTER);
>>> @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>> /* Set up domains for CPUs specified by the cpu_map: */
>>> for_each_cpu(i, cpu_map) {
>>> - struct sched_domain_topology_level *tl;
>>> + struct sched_domain_topology_level *tl, *tl_llc = NULL;
>>> + int lid;
>>> sd = NULL;
>>> for_each_sd_topology(tl) {
>>> + int flags = 0;
>>> +
>>> + if (tl->sd_flags)
>>> + flags = (*tl->sd_flags)();
>>> +
>>> + if (flags & SD_SHARE_LLC)
>>> + tl_llc = tl;
>>
>> nit. This loop breaks out when sched_domain_span(sd) covers the entire
>> cpu_map and it might have not reached the topmost SD_SHARE_LLC domain
>> yet. Is that cause for any concern?
>>
>
> Could you please elaborate a little more on this? If it covers the
> entire cpu_map shouldn't it stop going up to its parent domain?
> Do you mean, sd_llc_1 and its parent sd_llc_2 could cover the same cpu_map,
> and we should let tl_llc to assigned to sd_llc_2 (sd_llc_1 be degenerated? )
I'm not sure if this is technically possible but assume following
topology:
[ LLC: 8-15 ]
[ SMT: 8,9 ][ SMT: 10,11 ] ... [ SMT: 14,15 ]
and the following series of events:
o All CPUs in LLC are offline to begin with (maxcpus = 1 like scenario).
o CPUs 10-15 are onlined first.
o CPU8 is put in a separate root partition and brought online.
(XXX: I'm not 100% sure if this is possible in this order)
o build_sched_domains() will bail out at SMT domain since the cpumap
is covered by tl->mask() and tl_llc = tl_smt.
o llc_id calculation uses the tl_smt->mask() which will not contain
CPUs 10-15 and CPU8 will get a unique LLC id even though there are
other online CPUs in the LLC with a different llc_id (!!!)
Instead, if we traversed to tl_mc, we would have seen all the online
CPUs in the MC and reused the llc_id from them. Might not be an issue on
its own but if this root partition is removed later, CPU8 will continue
to have the unique llc_id even after merging into the same MC domain.
[..snip..]
>>
>> It doesn't compact tl_max_llcs, but it should promote reuse of llc_id if
>> all CPUs of a LLC go offline. I know it is a ridiculous scenario but it
>> is possible nonetheless.
>>
>> I'll let Peter and Valentin be the judge of additional space and
>> complexity needed for these bits :-)
>>
>
> Smart approach! Dynamically reallocating the llc_id should be feasible,
> as it releases the llc_id when the last CPU of that LLC is offlined. My
> only concern is data synchronization issues arising from the reuse of
> llc_id during load balancing - I’ll audit the logic to check for any race
> conditions. Alternatively, what if we introduce a tl->static_mask? It would
> be similar to tl->mask, but would not remove CPUs from static_mask when they
> are offlined. This way, we can always find and reuse the llc_id of CPUs in
> that LLC (even if all CPUs in the LLC have been offlined at some point,
> provided they were once online), and we would thus maintain a static llc_id.
That is possible but it would require a larger arch/ wide audit to add
support for. Might be less complex to handle in the generic layer but
again I'll let Peter and Valentin comment on this part :-)
>
> Anyway, let do some testings on your proposal as well as static_mask things,
> and I'll reply to this thread later. Thanks for the insights!
Thanks a ton! Much appreciated.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-15 14:25 ` Chen, Yu C
@ 2026-02-17 10:05 ` Madadi Vineeth Reddy
2026-02-17 21:20 ` Tim Chen
0 siblings, 1 reply; 117+ messages in thread
From: Madadi Vineeth Reddy @ 2026-02-17 10:05 UTC (permalink / raw)
To: Chen, Yu C
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li,
Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef, Libo Chen,
linux-kernel, Tim Chen, Madadi Vineeth Reddy
On 15/02/26 19:55, Chen, Yu C wrote:
> On 2/15/2026 1:53 AM, Madadi Vineeth Reddy wrote:
>> On 11/02/26 03:48, Tim Chen wrote:
>>> From: Chen Yu <yu.c.chen@intel.com>
>>>
>>> Introduce an index mapping between CPUs and their LLCs. This provides
>>> a continuous per LLC index needed for cache-aware load balancing in
>>> later patches.
>>>
>>> The existing per_cpu llc_id usually points to the first CPU of the
>>> LLC domain, which is sparse and unsuitable as an array index. Using
>>> llc_id directly would waste memory.
>>>
>>> With the new mapping, CPUs in the same LLC share a continuous id:
>>>
>>> per_cpu(llc_id, CPU=0...15) = 0
>>> per_cpu(llc_id, CPU=16...31) = 1
>>> per_cpu(llc_id, CPU=32...47) = 2
>>> ...
>>>
>>> Once a CPU has been assigned an llc_id, this ID persists even when
>>> the CPU is taken offline and brought back online, which can facilitate
>>> the management of the ID.
>>
>> tl_max_llcs is never reset across multiple invocations of build_sched_domains().
>> While this preserves LLC IDs across normal CPU hotplug events, I'm wondering about
>> scenarios where hardware topology changes, such as physically removing/replacing
>> CPU sockets.
>>
>> Example scenario:
>> Boot with 3 LLCs: IDs {0,1,2}, tl_max_llcs=3
>> Physical hardware change removes LLC 1
>> New hardware added at a different position gets ID=3
>> After multiple such events: System has 4 LLCs but IDs {0,2,5,7}, tl_max_llcs=8
>>
>
> I agree that keeping tl_max_llcs non-decreasing might waste some space. The
> original motivation for introducing a dynamic sd_llc_id was mainly that a
> static sd_llc_id[NR_LLC] is not suitable, as we cannot find a proper upper
> limit for NR_LLC-an arbitrary value for NR_LLC is unacceptable. That is to
> say, tl_max_llcs serves as the historical maximum LLC index that has ever
> been detected - like other terms such as CPU id. It is possible that the
> number of available LLCs shrinks due to CPU offline after boot-up. A value
> of tl_max_llcs=8 indicates that this system once had 8 valid LLCs. On the
> other hand, dense mapping is a side effect of dynamically allocating sd_llc_id.
>
>> This creates gaps in the ID space. However, I understand this trade-off might be
>> intentional since physical topology changes are rare, and resetting tl_max_llcs and
>> all sd_llc_id values would rebuild IDs on every invocation of build_sched_domains().
>>
>> Would like to know your thoughts on overhead of resetting tl_max_llcs and sd_llc_id
>> so that IDs are rebuilt on each invocation of build_sched_domains() to always maintain
>> a dense mapping.
>>
>
> The current implementation is intentionally kept simple for easier review, and
> I agree that strictly enforcing a dense mapping for sd_llc_id - by recalculating
> the actual maximum LLC count (max_llcs) whenever the CPU topology changes - could
> be an optimization direction once the basic version has been accepted. I assume what
> you are suggesting is that we could reset tl_max_llcs/max_llcs/sd_llc_id for CPUs
> in doms_new[i] within partition_sched_domains_locked() - and then rebuild these
> values in build_sched_domains() accordingly. One risk here is a race condition when
> modifying the llc_id of a specific CPU - but off the top of my head, valid_llc_buf()
> should help prevent out-of-range access to sd->pf caused by such races.
> Thoughts?
Yes, resetting and rebuilding would maintain dense mapping. Given the added complexity
of race conditions vs. minimal benefit (gaps only occur with physical topology changes),
I think the current approach is better. We can revisit it once this version goes through.
Thanks,
Vineeth
>
> thanks,
> Chenyu
>
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 11/21] sched/cache: Prioritize tasks preferring destination LLC during balancing
2026-02-10 22:18 ` [PATCH v3 11/21] sched/cache: Prioritize tasks preferring destination LLC during balancing Tim Chen
@ 2026-02-17 18:33 ` Madadi Vineeth Reddy
2026-02-17 21:45 ` Tim Chen
0 siblings, 1 reply; 117+ messages in thread
From: Madadi Vineeth Reddy @ 2026-02-17 18:33 UTC (permalink / raw)
To: Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Chen Yu,
Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef,
Libo Chen, linux-kernel, Madadi Vineeth Reddy
On 11/02/26 03:48, Tim Chen wrote:
> During LLC load balancing, first check for tasks that prefer the
> destination LLC and balance them to it before others.
>
> Mark source sched groups containing tasks preferring non local LLCs
> with the group_llc_balance flag. This ensures the load balancer later
> pulls or pushes these tasks toward their preferred LLCs.
>
> The load balancer selects the busiest sched_group and migrates tasks
> to less busy groups to distribute load across CPUs.
>
> With cache-aware scheduling enabled, the busiest sched_group is
> the one with most tasks preferring the destination LLC. If
> the group has the llc_balance flag set, cache aware load balancing is
> triggered.
>
> Introduce the helper function update_llc_busiest() to identify the
> sched_group with the most tasks preferring the destination LLC.
>
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Co-developed-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>
> Notes:
> v2->v3:
> Consider sd->nr_balance_failed when deciding whether
> LLC load balance should be used.
> (Peter Zijlstra)
>
> kernel/sched/fair.c | 77 ++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 76 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b0cf4424d198..43dcf2827298 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9649,6 +9649,11 @@ enum group_type {
> * from balancing the load across the system.
> */
> group_imbalanced,
> + /*
> + * There are tasks running on non-preferred LLC, possible to move
> + * them to their preferred LLC without creating too much imbalance.
> + */
> + group_llc_balance,
> /*
> * The CPU is overloaded and can't provide expected CPU cycles to all
> * tasks.
> @@ -10561,6 +10566,7 @@ struct sg_lb_stats {
> enum group_type group_type;
> unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
> unsigned int group_smt_balance; /* Task on busy SMT be moved */
> + unsigned int group_llc_balance; /* Tasks should be moved to preferred LLC */
> unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
> #ifdef CONFIG_NUMA_BALANCING
> unsigned int nr_numa_running;
> @@ -10819,6 +10825,9 @@ group_type group_classify(unsigned int imbalance_pct,
> if (group_is_overloaded(imbalance_pct, sgs))
> return group_overloaded;
>
> + if (sgs->group_llc_balance)
> + return group_llc_balance;
> +
group_llc_balance is placed before group_imbalanced. In cases where a group is both imbalanced and
contains tasks preferring the destination LLC, LLC balancing will be selected first.
I assume the reasoning is that migrating tasks toward their preferred LLC may also help reduce
imbalance, and in cases where the goals conflict, the nr_balance_failed / cache_nice_tries
logic will eventually fall back to regular load balancing. Is that the intended policy?
It might be helpful to briefly mention this reasoning in the changelog, since this ordering
changes balancing priority.
Thanks,
Vineeth
> if (sg_imbalanced(group))
> return group_imbalanced;
>
> @@ -11012,11 +11021,66 @@ static void record_sg_llc_stats(struct lb_env *env,
> if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
> WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
> }
> +
> +/*
> + * Do LLC balance on sched group that contains LLC, and have tasks preferring
> + * to run on LLC in idle dst_cpu.
> + */
> +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> + struct sched_group *group)
> +{
> + if (!sched_cache_enabled())
> + return false;
> +
> + if (env->sd->flags & SD_SHARE_LLC)
> + return false;
> +
> + /*
> + * Don't do cache aware balancing if there
> + * are too many balance failures.
> + *
> + * Should fall back to regular load balancing
> + * after repeated cache aware balance failures.
> + */
> + if (env->sd->nr_balance_failed >=
> + env->sd->cache_nice_tries + 1)
> + return false;
> +
> + if (sgs->nr_pref_dst_llc &&
> + can_migrate_llc(cpumask_first(sched_group_span(group)),
> + env->dst_cpu, 0, true) == mig_llc)
> + return true;
> +
> + return false;
> +}
> +
> +static bool update_llc_busiest(struct lb_env *env,
> + struct sg_lb_stats *busiest,
> + struct sg_lb_stats *sgs)
> +{
> + /*
> + * There are more tasks that want to run on dst_cpu's LLC.
> + */
> + return sgs->nr_pref_dst_llc > busiest->nr_pref_dst_llc;
> +}
> #else
> static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
> struct sched_group *group)
> {
> }
> +
> +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> + struct sched_group *group)
> +{
> + return false;
> +}
> +
> +static bool update_llc_busiest(struct lb_env *env,
> + struct sg_lb_stats *busiest,
> + struct sg_lb_stats *sgs)
> +{
> + return false;
> +}
> #endif
>
> /**
> @@ -11118,6 +11182,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> /* Check for loaded SMT group to be balanced to dst CPU */
> if (smt_balance(env, sgs, group))
> sgs->group_smt_balance = 1;
> +
> + /* Check for tasks in this group can be moved to their preferred LLC */
> + if (llc_balance(env, sgs, group))
> + sgs->group_llc_balance = 1;
> }
>
> sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
> @@ -11181,6 +11249,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> /* Select the overloaded group with highest avg_load. */
> return sgs->avg_load > busiest->avg_load;
>
> + case group_llc_balance:
> + /* Select the group with most tasks preferring dst LLC */
> + return update_llc_busiest(env, busiest, sgs);
> +
> case group_imbalanced:
> /*
> * Select the 1st imbalanced group as we don't have any way to
> @@ -11443,6 +11515,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
> return false;
> break;
>
> + case group_llc_balance:
> case group_imbalanced:
> case group_asym_packing:
> case group_smt_balance:
> @@ -11575,6 +11648,7 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
> return NULL;
> break;
>
> + case group_llc_balance:
> case group_imbalanced:
> case group_asym_packing:
> case group_smt_balance:
> @@ -12074,7 +12148,8 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
> * group's child domain.
> */
> if (sds.prefer_sibling && local->group_type == group_has_spare &&
> - sibling_imbalance(env, &sds, busiest, local) > 1)
> + (busiest->group_type == group_llc_balance ||
> + sibling_imbalance(env, &sds, busiest, local) > 1))
> goto force_balance;
>
> if (busiest->group_type != group_overloaded) {
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 01/21] sched/cache: Introduce infrastructure for cache-aware load balancing
2026-02-14 15:34 ` Chen, Yu C
@ 2026-02-17 18:51 ` Tim Chen
0 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-17 18:51 UTC (permalink / raw)
To: Chen, Yu C, Madadi Vineeth Reddy
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li,
Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef, Libo Chen,
linux-kernel
On Sat, 2026-02-14 at 23:34 +0800, Chen, Yu C wrote:
> Hi Vineeth,
>
> On 2/14/2026 8:26 PM, Madadi Vineeth Reddy wrote:
> > Hi Tim,
> > Thanks for the patch series.
> >
> > On 11/02/26 03:48, Tim Chen wrote:
> > > From: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> > >
> > > Adds infrastructure to enable cache-aware load balancing,
> > > which improves cache locality by grouping tasks that share resources
> > > within the same cache domain. This reduces cache misses and improves
> > > overall data access efficiency.
> >
> > [..snip..]
> >
> > > +void mm_init_sched(struct mm_struct *mm,
> > > + struct sched_cache_time __percpu *_pcpu_sched)
> > > +{
> > > + unsigned long epoch;
> > > + int i;
> > > +
> > > + for_each_possible_cpu(i) {
> > > + struct sched_cache_time *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
> > > + struct rq *rq = cpu_rq(i);
> > > +
> > > + pcpu_sched->runtime = 0;
> > > + pcpu_sched->epoch = rq->cpu_epoch;
> > > + epoch = rq->cpu_epoch;
> >
> > Shouldn't cpu_epoch be read under cpu_epoch_lock, similar to how fraction_mm_sched()
> > and __update_mm_sched() acquire the lock before accessing this field?
>
> My understanding is that __update_mm_sched() updates rq->cpu_epoch in
> two steps:
> first, it reads the current value, and then it writes the new value back
> to it
> (as seen in the operation rq->cpu_epoch += n). For this reason, a lock
> is required
> to prevent race conditions during concurrent updates across multiple CPUs.
>
> In contrast, reading rq->cpu_epoch in mm_init_sched() is a single atomic
> operation,
> and it is acceptable to read a stale value in this scenario - thus, we
> can safely
> perform an unprotected read of this field here.
>
There is no particular advantage to prevent cpu_epoch update during mm
initialization by holding the cpu_epoch lock. The difference with using
a slightly stale cpu epoch will be one epoch in __update_mm_sched()
which will be quickly aged out in the subsequent updates.
However holding the lock could affect scalability and slow down workload that forks
short lived processes frequently and hence the choice.
Tim
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 13/21] sched/cache: Handle moving single tasks to/from their preferred LLC
2026-02-10 22:18 ` [PATCH v3 13/21] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
@ 2026-02-17 19:00 ` Madadi Vineeth Reddy
2026-02-17 22:04 ` Tim Chen
2026-02-20 13:53 ` Peter Zijlstra
1 sibling, 1 reply; 117+ messages in thread
From: Madadi Vineeth Reddy @ 2026-02-17 19:00 UTC (permalink / raw)
To: Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Chen Yu,
Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef,
Libo Chen, linux-kernel, Madadi Vineeth Reddy
On 11/02/26 03:48, Tim Chen wrote:
> In the generic load balance(non-cache-aware-load-balance),
> if the busiest runqueue has only one task, active balancing may be
> invoked to move it. However, this migration might break LLC locality.
>
> Before migration, check whether the task is running on its preferred
> LLC: Do not move a lone task to another LLC if it would move the task
> away from its preferred LLC or cause excessive imbalance between LLCs.
>
> On the other hand, if the migration type is migrate_llc_task, it means
> that there are tasks on the env->src_cpu that want to be migrated to
> their preferred LLC, launch the active load balance anyway.
Nit:
But the check of migrate_llc_task is made after checking alb_break_llc
which seems to be contradicting. I understand that this check
env->src_rq->nr_pref_llc_running == env->src_rq->cfs.h_nr_runnable
prevents alb_break_llc to return true when migrate_llc_task exists. However,
checking migrate_llc_task first would make the priority and intent more
explicit.
Thanks,
Vineeth
>
> Co-developed-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>
> Notes:
> v2->v3:
> Remove redundant rcu read lock in break_llc_locality().
>
> kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 53 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1697791ef11c..03959a701514 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9999,12 +9999,60 @@ static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu
> task_util(p), to_pref);
> }
>
> +/*
> + * Check if active load balance breaks LLC locality in
> + * terms of cache aware load balance.
> + */
> +static inline bool
> +alb_break_llc(struct lb_env *env)
> +{
> + if (!sched_cache_enabled())
> + return false;
> +
> + if (cpus_share_cache(env->src_cpu, env->dst_cpu))
> + return false;
> + /*
> + * All tasks prefer to stay on their current CPU.
> + * Do not pull a task from its preferred CPU if:
> + * 1. It is the only task running there; OR
> + * 2. Migrating it away from its preferred LLC would violate
> + * the cache-aware scheduling policy.
> + */
> + if (env->src_rq->nr_pref_llc_running &&
> + env->src_rq->nr_pref_llc_running == env->src_rq->cfs.h_nr_runnable) {
> + unsigned long util = 0;
> + struct task_struct *cur;
> +
> + if (env->src_rq->nr_running <= 1)
> + return true;
> +
> + /*
> + * Reach here in load balance with
> + * rcu_read_lock() protected.
> + */
> + cur = rcu_dereference(env->src_rq->curr);
> + if (cur)
> + util = task_util(cur);
> +
> + if (can_migrate_llc(env->src_cpu, env->dst_cpu,
> + util, false) == mig_forbid)
> + return true;
> + }
> +
> + return false;
> +}
> #else
> static inline bool get_llc_stats(int cpu, unsigned long *util,
> unsigned long *cap)
> {
> return false;
> }
> +
> +static inline bool
> +alb_break_llc(struct lb_env *env)
> +{
> + return false;
> +}
> #endif
> /*
> * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
> @@ -12421,6 +12469,9 @@ static int need_active_balance(struct lb_env *env)
> {
> struct sched_domain *sd = env->sd;
>
> + if (alb_break_llc(env))
> + return 0;
> +
> if (asym_active_balance(env))
> return 1;
>
> @@ -12440,7 +12491,8 @@ static int need_active_balance(struct lb_env *env)
> return 1;
> }
>
> - if (env->migration_type == migrate_misfit)
> + if (env->migration_type == migrate_misfit ||
> + env->migration_type == migrate_llc_task)
> return 1;
>
> return 0;
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-17 10:05 ` Madadi Vineeth Reddy
@ 2026-02-17 21:20 ` Tim Chen
0 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-17 21:20 UTC (permalink / raw)
To: Madadi Vineeth Reddy, Chen, Yu C
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li,
Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef, Libo Chen,
linux-kernel
On Tue, 2026-02-17 at 15:35 +0530, Madadi Vineeth Reddy wrote:
> On 15/02/26 19:55, Chen, Yu C wrote:
> > On 2/15/2026 1:53 AM, Madadi Vineeth Reddy wrote:
> > > On 11/02/26 03:48, Tim Chen wrote:
> > > > From: Chen Yu <yu.c.chen@intel.com>
> > > >
> > > > Introduce an index mapping between CPUs and their LLCs. This provides
> > > > a continuous per LLC index needed for cache-aware load balancing in
> > > > later patches.
> > > >
> > > > The existing per_cpu llc_id usually points to the first CPU of the
> > > > LLC domain, which is sparse and unsuitable as an array index. Using
> > > > llc_id directly would waste memory.
> > > >
> > > > With the new mapping, CPUs in the same LLC share a continuous id:
> > > >
> > > > per_cpu(llc_id, CPU=0...15) = 0
> > > > per_cpu(llc_id, CPU=16...31) = 1
> > > > per_cpu(llc_id, CPU=32...47) = 2
> > > > ...
> > > >
> > > > Once a CPU has been assigned an llc_id, this ID persists even when
> > > > the CPU is taken offline and brought back online, which can facilitate
> > > > the management of the ID.
> > >
> > > tl_max_llcs is never reset across multiple invocations of build_sched_domains().
> > > While this preserves LLC IDs across normal CPU hotplug events, I'm wondering about
> > > scenarios where hardware topology changes, such as physically removing/replacing
> > > CPU sockets.
> > >
> > > Example scenario:
> > > Boot with 3 LLCs: IDs {0,1,2}, tl_max_llcs=3
> > > Physical hardware change removes LLC 1
> > > New hardware added at a different position gets ID=3
> > > After multiple such events: System has 4 LLCs but IDs {0,2,5,7}, tl_max_llcs=8
> > >
> >
> > I agree that keeping tl_max_llcs non-decreasing might waste some space. The
> > original motivation for introducing a dynamic sd_llc_id was mainly that a
> > static sd_llc_id[NR_LLC] is not suitable, as we cannot find a proper upper
> > limit for NR_LLC-an arbitrary value for NR_LLC is unacceptable. That is to
> > say, tl_max_llcs serves as the historical maximum LLC index that has ever
> > been detected - like other terms such as CPU id. It is possible that the
> > number of available LLCs shrinks due to CPU offline after boot-up. A value
> > of tl_max_llcs=8 indicates that this system once had 8 valid LLCs. On the
> > other hand, dense mapping is a side effect of dynamically allocating sd_llc_id.
> >
> > > This creates gaps in the ID space. However, I understand this trade-off might be
> > > intentional since physical topology changes are rare, and resetting tl_max_llcs and
> > > all sd_llc_id values would rebuild IDs on every invocation of build_sched_domains().
> > >
> > > Would like to know your thoughts on overhead of resetting tl_max_llcs and sd_llc_id
> > > so that IDs are rebuilt on each invocation of build_sched_domains() to always maintain
> > > a dense mapping.
> > >
> >
> > The current implementation is intentionally kept simple for easier review, and
> > I agree that strictly enforcing a dense mapping for sd_llc_id - by recalculating
> > the actual maximum LLC count (max_llcs) whenever the CPU topology changes - could
> > be an optimization direction once the basic version has been accepted. I assume what
> > you are suggesting is that we could reset tl_max_llcs/max_llcs/sd_llc_id for CPUs
> > in doms_new[i] within partition_sched_domains_locked() - and then rebuild these
> > values in build_sched_domains() accordingly. One risk here is a race condition when
> > modifying the llc_id of a specific CPU - but off the top of my head, valid_llc_buf()
> > should help prevent out-of-range access to sd->pf caused by such races.
> > Thoughts?
>
> Yes, resetting and rebuilding would maintain dense mapping. Given the added complexity
> of race conditions vs. minimal benefit (gaps only occur with physical topology changes),
> I think the current approach is better. We can revisit it once this version goes through.
>
The current implementation keep LLC id unchanged across sched domain rebuild.
The idea was to allow pf[id] to be kept across rebuilds, and point to
the same LLC.
That said, now that we clear pf[id] across sched domain rebuild, this constraint can
be relaxed. And it should be okay to change the LLC id from the perspective of cache
aware scheduling.
However, there could be some transient races with cpus_share_cache() while the
LLC id got changed, which the current implementation avoid.
Tim
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 11/21] sched/cache: Prioritize tasks preferring destination LLC during balancing
2026-02-17 18:33 ` Madadi Vineeth Reddy
@ 2026-02-17 21:45 ` Tim Chen
0 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-17 21:45 UTC (permalink / raw)
To: Madadi Vineeth Reddy
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Chen Yu,
Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef,
Libo Chen, linux-kernel
On Wed, 2026-02-18 at 00:03 +0530, Madadi Vineeth Reddy wrote:
> On 11/02/26 03:48, Tim Chen wrote:
> > During LLC load balancing, first check for tasks that prefer the
> > destination LLC and balance them to it before others.
> >
> > Mark source sched groups containing tasks preferring non local LLCs
> > with the group_llc_balance flag. This ensures the load balancer later
> > pulls or pushes these tasks toward their preferred LLCs.
> >
> > The load balancer selects the busiest sched_group and migrates tasks
> > to less busy groups to distribute load across CPUs.
> >
> > With cache-aware scheduling enabled, the busiest sched_group is
> > the one with most tasks preferring the destination LLC. If
> > the group has the llc_balance flag set, cache aware load balancing is
> > triggered.
> >
> > Introduce the helper function update_llc_busiest() to identify the
> > sched_group with the most tasks preferring the destination LLC.
> >
> > Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Co-developed-by: Chen Yu <yu.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >
> > Notes:
> > v2->v3:
> > Consider sd->nr_balance_failed when deciding whether
> > LLC load balance should be used.
> > (Peter Zijlstra)
> >
> > kernel/sched/fair.c | 77 ++++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 76 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b0cf4424d198..43dcf2827298 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9649,6 +9649,11 @@ enum group_type {
> > * from balancing the load across the system.
> > */
> > group_imbalanced,
> > + /*
> > + * There are tasks running on non-preferred LLC, possible to move
> > + * them to their preferred LLC without creating too much imbalance.
> > + */
> > + group_llc_balance,
> > /*
> > * The CPU is overloaded and can't provide expected CPU cycles to all
> > * tasks.
> > @@ -10561,6 +10566,7 @@ struct sg_lb_stats {
> > enum group_type group_type;
> > unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
> > unsigned int group_smt_balance; /* Task on busy SMT be moved */
> > + unsigned int group_llc_balance; /* Tasks should be moved to preferred LLC */
> > unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
> > #ifdef CONFIG_NUMA_BALANCING
> > unsigned int nr_numa_running;
> > @@ -10819,6 +10825,9 @@ group_type group_classify(unsigned int imbalance_pct,
> > if (group_is_overloaded(imbalance_pct, sgs))
> > return group_overloaded;
> >
> > + if (sgs->group_llc_balance)
> > + return group_llc_balance;
> > +
>
> group_llc_balance is placed before group_imbalanced. In cases where a group is both imbalanced and
> contains tasks preferring the destination LLC, LLC balancing will be selected first.
>
> I assume the reasoning is that migrating tasks toward their preferred LLC may also help reduce
> imbalance, and in cases where the goals conflict, the nr_balance_failed / cache_nice_tries
> logic will eventually fall back to regular load balancing. Is that the intended policy?
>
> It might be helpful to briefly mention this reasoning in the changelog, since this ordering
> changes balancing priority.
>
group_llc_balance naturally aggregate tasks to LLC and could create imbalance
between LLC domains.
If we do group_imbalanced first, then after we balanced the load
and move on to consider group_llc_balance,
group_llc_balance will cause load imbalance between the LLCs again
and undo all the previous load balance work.
It is better to do group_llc_balance to move the tasks to their preferred
LLC first, then let group_imbalanced do adjustments to imbalance in load.
The can_migrate_llc_task() check will prevent group_imbalanced from undoing
the work done previously in group_llc_balance.
Yes, we'll add some comments to explain the reasoning of load balance priority.
Tim
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 13/21] sched/cache: Handle moving single tasks to/from their preferred LLC
2026-02-17 19:00 ` Madadi Vineeth Reddy
@ 2026-02-17 22:04 ` Tim Chen
0 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-17 22:04 UTC (permalink / raw)
To: Madadi Vineeth Reddy
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Chen Yu,
Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef,
Libo Chen, linux-kernel
On Wed, 2026-02-18 at 00:30 +0530, Madadi Vineeth Reddy wrote:
> On 11/02/26 03:48, Tim Chen wrote:
> > In the generic load balance(non-cache-aware-load-balance),
> > if the busiest runqueue has only one task, active balancing may be
> > invoked to move it. However, this migration might break LLC locality.
> >
> > Before migration, check whether the task is running on its preferred
> > LLC: Do not move a lone task to another LLC if it would move the task
> > away from its preferred LLC or cause excessive imbalance between LLCs.
> >
> > On the other hand, if the migration type is migrate_llc_task, it means
> > that there are tasks on the env->src_cpu that want to be migrated to
> > their preferred LLC, launch the active load balance anyway.
>
> Nit:
> But the check of migrate_llc_task is made after checking alb_break_llc
> which seems to be contradicting. I understand that this check
>
> env->src_rq->nr_pref_llc_running == env->src_rq->cfs.h_nr_runnable
>
> prevents alb_break_llc to return true when migrate_llc_task exists. However,
> checking migrate_llc_task first would make the priority and intent more
> explicit.
We have actually considered that.
Suppose we do migrate_llc_task check first, we still have to check that
this migration will not cause load balance to go beyond the load
imbalance we allow with can_migrate_llc(), i.e. do the same
check as in alb_break_llc().
Then for other kinds of task migration, we also need to check
that those migrations don't break LLC policy with alb_break_llc().
So it is better to just do the alb_break_llc() check first to
cover all migration types.
It doesn't really help to save any code by moving migrate_llc_task
check up.
Tim
>
> Thanks,
> Vineeth
>
> >
> > Co-developed-by: Chen Yu <yu.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >
> > Notes:
> > v2->v3:
> > Remove redundant rcu read lock in break_llc_locality().
> >
> > kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 53 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 1697791ef11c..03959a701514 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9999,12 +9999,60 @@ static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu
> > task_util(p), to_pref);
> > }
> >
> > +/*
> > + * Check if active load balance breaks LLC locality in
> > + * terms of cache aware load balance.
> > + */
> > +static inline bool
> > +alb_break_llc(struct lb_env *env)
> > +{
> > + if (!sched_cache_enabled())
> > + return false;
> > +
> > + if (cpus_share_cache(env->src_cpu, env->dst_cpu))
> > + return false;
> > + /*
> > + * All tasks prefer to stay on their current CPU.
> > + * Do not pull a task from its preferred CPU if:
> > + * 1. It is the only task running there; OR
> > + * 2. Migrating it away from its preferred LLC would violate
> > + * the cache-aware scheduling policy.
> > + */
> > + if (env->src_rq->nr_pref_llc_running &&
> > + env->src_rq->nr_pref_llc_running == env->src_rq->cfs.h_nr_runnable) {
> > + unsigned long util = 0;
> > + struct task_struct *cur;
> > +
> > + if (env->src_rq->nr_running <= 1)
> > + return true;
> > +
> > + /*
> > + * Reach here in load balance with
> > + * rcu_read_lock() protected.
> > + */
> > + cur = rcu_dereference(env->src_rq->curr);
> > + if (cur)
> > + util = task_util(cur);
> > +
> > + if (can_migrate_llc(env->src_cpu, env->dst_cpu,
> > + util, false) == mig_forbid)
> > + return true;
> > + }
> > +
> > + return false;
> > +}
> > #else
> > static inline bool get_llc_stats(int cpu, unsigned long *util,
> > unsigned long *cap)
> > {
> > return false;
> > }
> > +
> > +static inline bool
> > +alb_break_llc(struct lb_env *env)
> > +{
> > + return false;
> > +}
> > #endif
> > /*
> > * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
> > @@ -12421,6 +12469,9 @@ static int need_active_balance(struct lb_env *env)
> > {
> > struct sched_domain *sd = env->sd;
> >
> > + if (alb_break_llc(env))
> > + return 0;
> > +
> > if (asym_active_balance(env))
> > return 1;
> >
> > @@ -12440,7 +12491,8 @@ static int need_active_balance(struct lb_env *env)
> > return 1;
> > }
> >
> > - if (env->migration_type == migrate_misfit)
> > + if (env->migration_type == migrate_misfit ||
> > + env->migration_type == migrate_llc_task)
> > return 1;
> >
> > return 0;
>
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-17 8:09 ` K Prateek Nayak
@ 2026-02-17 23:12 ` Tim Chen
2026-02-18 3:28 ` K Prateek Nayak
2026-02-18 15:11 ` Chen, Yu C
2026-02-19 15:48 ` Peter Zijlstra
2 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-17 23:12 UTC (permalink / raw)
To: K Prateek Nayak, Chen, Yu C
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
On Tue, 2026-02-17 at 13:39 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
>
[...snip...]
> > > > */
> > > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > > > DEFINE_PER_CPU(int, sd_llc_size);
> > > > -DEFINE_PER_CPU(int, sd_llc_id);
> > > > +DEFINE_PER_CPU(int, sd_llc_id) = -1;
> > > > DEFINE_PER_CPU(int, sd_share_id);
> > > > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > > > @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
> > > > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> > > > per_cpu(sd_llc_size, cpu) = size;
> > > > - per_cpu(sd_llc_id, cpu) = id;
> > > > rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> > > > sd = lowest_flag_domain(cpu, SD_CLUSTER);
> > > > @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > > > /* Set up domains for CPUs specified by the cpu_map: */
> > > > for_each_cpu(i, cpu_map) {
> > > > - struct sched_domain_topology_level *tl;
> > > > + struct sched_domain_topology_level *tl, *tl_llc = NULL;
> > > > + int lid;
> > > > sd = NULL;
> > > > for_each_sd_topology(tl) {
> > > > + int flags = 0;
> > > > +
> > > > + if (tl->sd_flags)
> > > > + flags = (*tl->sd_flags)();
> > > > +
> > > > + if (flags & SD_SHARE_LLC)
> > > > + tl_llc = tl;
> > >
> > > nit. This loop breaks out when sched_domain_span(sd) covers the entire
> > > cpu_map and it might have not reached the topmost SD_SHARE_LLC domain
> > > yet. Is that cause for any concern?
> > >
> >
> > Could you please elaborate a little more on this? If it covers the
> > entire cpu_map shouldn't it stop going up to its parent domain?
> > Do you mean, sd_llc_1 and its parent sd_llc_2 could cover the same cpu_map,
> > and we should let tl_llc to assigned to sd_llc_2 (sd_llc_1 be degenerated? )
>
> I'm not sure if this is technically possible but assume following
> topology:
>
> [ LLC: 8-15 ]
> [ SMT: 8,9 ][ SMT: 10,11 ] ... [ SMT: 14,15 ]
>
> and the following series of events:
>
> o All CPUs in LLC are offline to begin with (maxcpus = 1 like scenario).
>
> o CPUs 10-15 are onlined first.
>
> o CPU8 is put in a separate root partition and brought online.
> (XXX: I'm not 100% sure if this is possible in this order)
>
> o build_sched_domains() will bail out at SMT domain since the cpumap
> is covered by tl->mask() and tl_llc = tl_smt.
>
> o llc_id calculation uses the tl_smt->mask() which will not contain
> CPUs 10-15 and CPU8 will get a unique LLC id even though there are
> other online CPUs in the LLC with a different llc_id (!!!)
>
>
> Instead, if we traversed to tl_mc, we would have seen all the online
> CPUs in the MC and reused the llc_id from them. Might not be an issue on
> its own but if this root partition is removed later, CPU8 will continue
> to have the unique llc_id even after merging into the same MC domain.
There is really no reason to reuse the llc_id as far as cache aware scheduling
goes in its v3 revision (see my reply to Madadi on this patch).
I am thinking that if we just simply rebuild LLC id across sched domain
rebuilds, that is probably the cleanest solution. There could be some races
in cpus_share_cache() as llc_id gets reassigned for some CPUs when they
come online/offline. But we also having similar races in current mainline code.
Worst it can do is some temporary sub-optimal scheduling task placement.
Thoughts?
Tim
>
> [..snip..]
>
> > >
> > > It doesn't compact tl_max_llcs, but it should promote reuse of llc_id if
> > > all CPUs of a LLC go offline. I know it is a ridiculous scenario but it
> > > is possible nonetheless.
> > >
> > > I'll let Peter and Valentin be the judge of additional space and
> > > complexity needed for these bits :-)
> > >
> >
> > Smart approach! Dynamically reallocating the llc_id should be feasible,
> > as it releases the llc_id when the last CPU of that LLC is offlined. My
> > only concern is data synchronization issues arising from the reuse of
> > llc_id during load balancing - I’ll audit the logic to check for any race
> > conditions. Alternatively, what if we introduce a tl->static_mask? It would
> > be similar to tl->mask, but would not remove CPUs from static_mask when they
> > are offlined. This way, we can always find and reuse the llc_id of CPUs in
> > that LLC (even if all CPUs in the LLC have been offlined at some point,
> > provided they were once online), and we would thus maintain a static llc_id.
>
> That is possible but it would require a larger arch/ wide audit to add
> support for. Might be less complex to handle in the generic layer but
> again I'll let Peter and Valentin comment on this part :-)
>
> >
> > Anyway, let do some testings on your proposal as well as static_mask things,
> > and I'll reply to this thread later. Thanks for the insights!
>
> Thanks a ton! Much appreciated.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-17 23:12 ` Tim Chen
@ 2026-02-18 3:28 ` K Prateek Nayak
2026-02-18 15:22 ` Chen, Yu C
2026-02-18 21:33 ` Tim Chen
0 siblings, 2 replies; 117+ messages in thread
From: K Prateek Nayak @ 2026-02-18 3:28 UTC (permalink / raw)
To: Tim Chen, Chen, Yu C
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
Hello Tim,
On 2/18/2026 4:42 AM, Tim Chen wrote:
> On Tue, 2026-02-17 at 13:39 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>>
>
> [...snip...]
>
>
>>>>> */
>>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>>> DEFINE_PER_CPU(int, sd_llc_size);
>>>>> -DEFINE_PER_CPU(int, sd_llc_id);
>>>>> +DEFINE_PER_CPU(int, sd_llc_id) = -1;
>>>>> DEFINE_PER_CPU(int, sd_share_id);
>>>>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>>> @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
>>>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>>>> per_cpu(sd_llc_size, cpu) = size;
>>>>> - per_cpu(sd_llc_id, cpu) = id;
>>>>> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>>>>> sd = lowest_flag_domain(cpu, SD_CLUSTER);
>>>>> @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>>>> /* Set up domains for CPUs specified by the cpu_map: */
>>>>> for_each_cpu(i, cpu_map) {
>>>>> - struct sched_domain_topology_level *tl;
>>>>> + struct sched_domain_topology_level *tl, *tl_llc = NULL;
>>>>> + int lid;
>>>>> sd = NULL;
>>>>> for_each_sd_topology(tl) {
>>>>> + int flags = 0;
>>>>> +
>>>>> + if (tl->sd_flags)
>>>>> + flags = (*tl->sd_flags)();
>>>>> +
>>>>> + if (flags & SD_SHARE_LLC)
>>>>> + tl_llc = tl;
>>>>
>>>> nit. This loop breaks out when sched_domain_span(sd) covers the entire
>>>> cpu_map and it might have not reached the topmost SD_SHARE_LLC domain
>>>> yet. Is that cause for any concern?
>>>>
>>>
>>> Could you please elaborate a little more on this? If it covers the
>>> entire cpu_map shouldn't it stop going up to its parent domain?
>>> Do you mean, sd_llc_1 and its parent sd_llc_2 could cover the same cpu_map,
>>> and we should let tl_llc to assigned to sd_llc_2 (sd_llc_1 be degenerated? )
>>
>> I'm not sure if this is technically possible but assume following
>> topology:
>>
>> [ LLC: 8-15 ]
>> [ SMT: 8,9 ][ SMT: 10,11 ] ... [ SMT: 14,15 ]
>>
>> and the following series of events:
>>
>> o All CPUs in LLC are offline to begin with (maxcpus = 1 like scenario).
>>
>> o CPUs 10-15 are onlined first.
>>
>> o CPU8 is put in a separate root partition and brought online.
>> (XXX: I'm not 100% sure if this is possible in this order)
>>
>> o build_sched_domains() will bail out at SMT domain since the cpumap
>> is covered by tl->mask() and tl_llc = tl_smt.
>>
>> o llc_id calculation uses the tl_smt->mask() which will not contain
>> CPUs 10-15 and CPU8 will get a unique LLC id even though there are
>> other online CPUs in the LLC with a different llc_id (!!!)
>>
>>
>> Instead, if we traversed to tl_mc, we would have seen all the online
>> CPUs in the MC and reused the llc_id from them. Might not be an issue on
>> its own but if this root partition is removed later, CPU8 will continue
>> to have the unique llc_id even after merging into the same MC domain.
>
> There is really no reason to reuse the llc_id as far as cache aware scheduling
> goes in its v3 revision (see my reply to Madadi on this patch).
Even I don't mind having some holes in the llc_id space when CPUs are
offlined but my major concern would be seeing an inconsistent state
where CPUs in same MC domains end up with different llc_id when after
a bunch of hotplug activity.
>
> I am thinking that if we just simply rebuild LLC id across sched domain
> rebuilds, that is probably the cleanest solution. There could be some races
> in cpus_share_cache() as llc_id gets reassigned for some CPUs when they
> come online/offline. But we also having similar races in current mainline code.
> Worst it can do is some temporary sub-optimal scheduling task placement.
>
> Thoughts?
If you are suggesting populating the sd_llc_id for all the CPUs on
topology rebuild, I'm not entirely against the idea.
On a separate note, if we add a dependency on SCHED_MC for SCHED_CACHE,
we can simply look at cpu_coregroup_mask() and either allocate a new
llc_id / borrow llc id in sched_cpu_activate() when CPU is onlined or
reassign them in sched_cpu_deactivate() if an entire LLC is offlined.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 14/21] sched/cache: Respect LLC preference in task migration and detach
2026-02-10 22:18 ` [PATCH v3 14/21] sched/cache: Respect LLC preference in task migration and detach Tim Chen
@ 2026-02-18 9:14 ` Madadi Vineeth Reddy
2026-02-18 15:34 ` Chen, Yu C
0 siblings, 1 reply; 117+ messages in thread
From: Madadi Vineeth Reddy @ 2026-02-18 9:14 UTC (permalink / raw)
To: Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Chen Yu,
Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef,
Libo Chen, linux-kernel, Madadi Vineeth Reddy
On 11/02/26 03:48, Tim Chen wrote:
> During the final step of load balancing, can_migrate_task() now
> considers a task's LLC preference before moving it out of its
> preferred LLC.
>
> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Co-developed-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>
> Notes:
> v2->v3:
> Use the similar mechanism as NUMA balancing, which skips over
> the tasks that would degrade locality in can_migrate_task();
> and only if nr_balanced_failed is high enough do we ignore that.
> (Peter Zijlstra)
>
> Let migrate_degrade_locality() take precedence over
> migrate_degrades_llc(), which aims to migrate towards the preferred
> NUMA node. (Peter Zijlstra)
>
> kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++---
> kernel/sched/sched.h | 13 +++++++++
> 2 files changed, 73 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 03959a701514..d1145997b88d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9973,8 +9973,8 @@ static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
> * Check if task p can migrate from source LLC to
> * destination LLC in terms of cache aware load balance.
> */
> -static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
> - struct task_struct *p)
> +static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
> + struct task_struct *p)
> {
> struct mm_struct *mm;
> bool to_pref;
> @@ -10041,6 +10041,47 @@ alb_break_llc(struct lb_env *env)
>
> return false;
> }
> +
> +/*
> + * Check if migrating task p from env->src_cpu to
> + * env->dst_cpu breaks LLC localiy.
> + */
> +static bool migrate_degrades_llc(struct task_struct *p, struct lb_env *env)
> +{
> + if (!sched_cache_enabled())
> + return false;
> +
> + if (task_has_sched_core(p))
> + return false;
> + /*
> + * Skip over tasks that would degrade LLC locality;
> + * only when nr_balanced_failed is sufficiently high do we
> + * ignore this constraint.
> + *
> + * Threshold of cache_nice_tries is set to 1 higher
> + * than nr_balance_failed to avoid excessive task
> + * migration at the same time. Refer to comments around
> + * llc_balance().
> + */
> + if (env->sd->nr_balance_failed >= env->sd->cache_nice_tries + 1)
> + return false;
> +
> + /*
> + * We know the env->src_cpu has some tasks prefer to
> + * run on env->dst_cpu, skip the tasks do not prefer
> + * env->dst_cpu, and find the one that prefers.
> + */
> + if (env->migration_type == migrate_llc_task &&
> + task_llc(p) != llc_id(env->dst_cpu))
> + return true;
`task_llc(p)` returns the LLC id of the CPU the task is currently running on, right?
Wouldn’t we need to check the task’s *preferred* LLC instead?
Am I missing something?
Thanks,
Vineeth
> +
> + if (can_migrate_llc_task(env->src_cpu,
> + env->dst_cpu, p) != mig_forbid)
> + return false;
> +
> + return true;
> +}
> +
> #else
> static inline bool get_llc_stats(int cpu, unsigned long *util,
> unsigned long *cap)
> @@ -10053,6 +10094,12 @@ alb_break_llc(struct lb_env *env)
> {
> return false;
> }
> +
> +static inline bool
> +migrate_degrades_llc(struct task_struct *p, struct lb_env *env)
> +{
> + return false;
> +}
> #endif
> /*
> * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
> @@ -10150,10 +10197,19 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> return 1;
>
> degrades = migrate_degrades_locality(p, env);
> - if (!degrades)
> + if (!degrades) {
> + /*
> + * If the NUMA locality is not broken,
> + * further check if migration would hurt
> + * LLC locality.
> + */
> + if (migrate_degrades_llc(p, env))
> + return 0;
> +
> hot = task_hot(p, env);
> - else
> + } else {
> hot = degrades > 0;
> + }
>
> if (!hot || env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
> if (hot)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ac8c7ac1ac0d..c18e59f320a6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1495,6 +1495,14 @@ extern void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags);
> extern void sched_core_get(void);
> extern void sched_core_put(void);
>
> +static inline bool task_has_sched_core(struct task_struct *p)
> +{
> + if (sched_core_disabled())
> + return false;
> +
> + return !!p->core_cookie;
> +}
> +
> #else /* !CONFIG_SCHED_CORE: */
>
> static inline bool sched_core_enabled(struct rq *rq)
> @@ -1534,6 +1542,11 @@ static inline bool sched_group_cookie_match(struct rq *rq,
> return true;
> }
>
> +static inline bool task_has_sched_core(struct task_struct *p)
> +{
> + return false;
> +}
> +
> #endif /* !CONFIG_SCHED_CORE */
>
> #ifdef CONFIG_RT_GROUP_SCHED
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-17 8:09 ` K Prateek Nayak
2026-02-17 23:12 ` Tim Chen
@ 2026-02-18 15:11 ` Chen, Yu C
2026-02-19 15:48 ` Peter Zijlstra
2 siblings, 0 replies; 117+ messages in thread
From: Chen, Yu C @ 2026-02-18 15:11 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Tim Chen, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
On 2/17/2026 4:09 PM, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 2/17/2026 11:37 AM, Chen, Yu C wrote:
>> Hi Prateek,
>>
>> On 2/16/2026 3:44 PM, K Prateek Nayak wrote:
>>> Hello Tim, Chenyu,
>>>
>>> On 2/11/2026 3:48 AM, Tim Chen wrote:
>>>> From: Chen Yu <yu.c.chen@intel.com>
[...]
>>>> @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>>> /* Set up domains for CPUs specified by the cpu_map: */
>>>> for_each_cpu(i, cpu_map) {
>>>> - struct sched_domain_topology_level *tl;
>>>> + struct sched_domain_topology_level *tl, *tl_llc = NULL;
>>>> + int lid;
>>>> sd = NULL;
>>>> for_each_sd_topology(tl) {
>>>> + int flags = 0;
>>>> +
>>>> + if (tl->sd_flags)
>>>> + flags = (*tl->sd_flags)();
>>>> +
>>>> + if (flags & SD_SHARE_LLC)
>>>> + tl_llc = tl;
>>>
>>> nit. This loop breaks out when sched_domain_span(sd) covers the entire
>>> cpu_map and it might have not reached the topmost SD_SHARE_LLC domain
>>> yet. Is that cause for any concern?
>>>
>>
>> Could you please elaborate a little more on this? If it covers the
>> entire cpu_map shouldn't it stop going up to its parent domain?
>> Do you mean, sd_llc_1 and its parent sd_llc_2 could cover the same cpu_map,
>> and we should let tl_llc to assigned to sd_llc_2 (sd_llc_1 be degenerated? )
>
> I'm not sure if this is technically possible but assume following
> topology:
>
> [ LLC: 8-15 ]
> [ SMT: 8,9 ][ SMT: 10,11 ] ... [ SMT: 14,15 ]
>
> and the following series of events:
>
> o All CPUs in LLC are offline to begin with (maxcpus = 1 like scenario).
>
> o CPUs 10-15 are onlined first.
>
> o CPU8 is put in a separate root partition and brought online.
> (XXX: I'm not 100% sure if this is possible in this order)
>
It can happen, I had a try on VM, and there would be different
llc_id within 1 LLC even after CPU8 has been taken out of the separate
partition.
> o build_sched_domains() will bail out at SMT domain since the cpumap
> is covered by tl->mask() and tl_llc = tl_smt.
>
> o llc_id calculation uses the tl_smt->mask() which will not contain
> CPUs 10-15 and CPU8 will get a unique LLC id even though there are
> other online CPUs in the LLC with a different llc_id (!!!)
>
>
Fair enough, thanks for the explanation in detail.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-18 3:28 ` K Prateek Nayak
@ 2026-02-18 15:22 ` Chen, Yu C
2026-02-18 17:46 ` K Prateek Nayak
2026-02-18 18:45 ` Tim Chen
2026-02-18 21:33 ` Tim Chen
1 sibling, 2 replies; 117+ messages in thread
From: Chen, Yu C @ 2026-02-18 15:22 UTC (permalink / raw)
To: K Prateek Nayak, Tim Chen
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
On 2/18/2026 11:28 AM, K Prateek Nayak wrote:
> Hello Tim,
>
> On 2/18/2026 4:42 AM, Tim Chen wrote:
>> On Tue, 2026-02-17 at 13:39 +0530, K Prateek Nayak wrote:
>>> Hello Chenyu,
>>>
>>>
>>
>> [...snip...]
>>
>>
>>>>>> */
>>>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>>>>> DEFINE_PER_CPU(int, sd_llc_size);
>>>>>> -DEFINE_PER_CPU(int, sd_llc_id);
>>>>>> +DEFINE_PER_CPU(int, sd_llc_id) = -1;
>>>>>> DEFINE_PER_CPU(int, sd_share_id);
>>>>>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>>>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>>>>> @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
>>>>>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>>>>> per_cpu(sd_llc_size, cpu) = size;
>>>>>> - per_cpu(sd_llc_id, cpu) = id;
>>>>>> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>>>>>> sd = lowest_flag_domain(cpu, SD_CLUSTER);
>>>>>> @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>>>>> /* Set up domains for CPUs specified by the cpu_map: */
>>>>>> for_each_cpu(i, cpu_map) {
>>>>>> - struct sched_domain_topology_level *tl;
>>>>>> + struct sched_domain_topology_level *tl, *tl_llc = NULL;
>>>>>> + int lid;
>>>>>> sd = NULL;
>>>>>> for_each_sd_topology(tl) {
>>>>>> + int flags = 0;
>>>>>> +
>>>>>> + if (tl->sd_flags)
>>>>>> + flags = (*tl->sd_flags)();
>>>>>> +
>>>>>> + if (flags & SD_SHARE_LLC)
>>>>>> + tl_llc = tl;
>>>>>
>>>>> nit. This loop breaks out when sched_domain_span(sd) covers the entire
>>>>> cpu_map and it might have not reached the topmost SD_SHARE_LLC domain
>>>>> yet. Is that cause for any concern?
>>>>>
>>>>
>>>> Could you please elaborate a little more on this? If it covers the
>>>> entire cpu_map shouldn't it stop going up to its parent domain?
>>>> Do you mean, sd_llc_1 and its parent sd_llc_2 could cover the same cpu_map,
>>>> and we should let tl_llc to assigned to sd_llc_2 (sd_llc_1 be degenerated? )
>>>
>>> I'm not sure if this is technically possible but assume following
>>> topology:
>>>
>>> [ LLC: 8-15 ]
>>> [ SMT: 8,9 ][ SMT: 10,11 ] ... [ SMT: 14,15 ]
>>>
>>> and the following series of events:
>>>
>>> o All CPUs in LLC are offline to begin with (maxcpus = 1 like scenario).
>>>
>>> o CPUs 10-15 are onlined first.
>>>
>>> o CPU8 is put in a separate root partition and brought online.
>>> (XXX: I'm not 100% sure if this is possible in this order)
>>>
>>> o build_sched_domains() will bail out at SMT domain since the cpumap
>>> is covered by tl->mask() and tl_llc = tl_smt.
>>>
>>> o llc_id calculation uses the tl_smt->mask() which will not contain
>>> CPUs 10-15 and CPU8 will get a unique LLC id even though there are
>>> other online CPUs in the LLC with a different llc_id (!!!)
>>>
>>>
>>> Instead, if we traversed to tl_mc, we would have seen all the online
>>> CPUs in the MC and reused the llc_id from them. Might not be an issue on
>>> its own but if this root partition is removed later, CPU8 will continue
>>> to have the unique llc_id even after merging into the same MC domain.
>>
>> There is really no reason to reuse the llc_id as far as cache aware scheduling
>> goes in its v3 revision (see my reply to Madadi on this patch).
>
> Even I don't mind having some holes in the llc_id space when CPUs are
> offlined but my major concern would be seeing an inconsistent state
> where CPUs in same MC domains end up with different llc_id when after
> a bunch of hotplug activity.
>
>>
>> I am thinking that if we just simply rebuild LLC id across sched domain
>> rebuilds, that is probably the cleanest solution.
Tim, do you mean reset all CPUs' LLC id to -1 whenever there is hotplug
event in partition_sched_domains_locked(), and rebuild them from scratch
in build_sched_domains(), so we already refresh the LLC id for every
CPU(I discussed with Vineeth here:
https://lore.kernel.org/all/54e60704-b0f3-44df-9b83-070806b5a00c@intel.com/)
>> There could be some races
>> in cpus_share_cache() as llc_id gets reassigned for some CPUs when they
>> come online/offline. But we also having similar races in current mainline code.
>> Worst it can do is some temporary sub-optimal scheduling task placement.
>>
>> Thoughts?
>
> If you are suggesting populating the sd_llc_id for all the CPUs on
> topology rebuild, I'm not entirely against the idea.
>
> On a separate note, if we add a dependency on SCHED_MC for SCHED_CACHE,
> we can simply look at cpu_coregroup_mask() and either allocate a new
> llc_id / borrow llc id in sched_cpu_activate() when CPU is onlined or
> reassign them in sched_cpu_deactivate() if an entire LLC is offlined.
>
Prateek, may I know if you are thinking of updating every CPU's LLC id
during its hotplug and not update all percpu LLC id in
build_sched_domains()?
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 14/21] sched/cache: Respect LLC preference in task migration and detach
2026-02-18 9:14 ` Madadi Vineeth Reddy
@ 2026-02-18 15:34 ` Chen, Yu C
0 siblings, 0 replies; 117+ messages in thread
From: Chen, Yu C @ 2026-02-18 15:34 UTC (permalink / raw)
To: Madadi Vineeth Reddy
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li,
Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef, Libo Chen,
linux-kernel, Tim Chen
On 2/18/2026 5:14 PM, Madadi Vineeth Reddy wrote:
> On 11/02/26 03:48, Tim Chen wrote:
>> During the final step of load balancing, can_migrate_task() now
>> considers a task's LLC preference before moving it out of its
>> preferred LLC.
>>
>> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> Co-developed-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>
>> Notes:
>> v2->v3:
>> Use the similar mechanism as NUMA balancing, which skips over
>> the tasks that would degrade locality in can_migrate_task();
>> and only if nr_balanced_failed is high enough do we ignore that.
>> (Peter Zijlstra)
>>
>> Let migrate_degrade_locality() take precedence over
>> migrate_degrades_llc(), which aims to migrate towards the preferred
>> NUMA node. (Peter Zijlstra)
>>
>> kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++---
>> kernel/sched/sched.h | 13 +++++++++
>> 2 files changed, 73 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 03959a701514..d1145997b88d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9973,8 +9973,8 @@ static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
>> * Check if task p can migrate from source LLC to
>> * destination LLC in terms of cache aware load balance.
>> */
>> -static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
>> - struct task_struct *p)
>> +static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
>> + struct task_struct *p)
>> {
>> struct mm_struct *mm;
>> bool to_pref;
>> @@ -10041,6 +10041,47 @@ alb_break_llc(struct lb_env *env)
>>
>> return false;
>> }
>> +
>> +/*
>> + * Check if migrating task p from env->src_cpu to
>> + * env->dst_cpu breaks LLC localiy.
>> + */
>> +static bool migrate_degrades_llc(struct task_struct *p, struct lb_env *env)
>> +{
>> + if (!sched_cache_enabled())
>> + return false;
>> +
>> + if (task_has_sched_core(p))
>> + return false;
>> + /*
>> + * Skip over tasks that would degrade LLC locality;
>> + * only when nr_balanced_failed is sufficiently high do we
>> + * ignore this constraint.
>> + *
>> + * Threshold of cache_nice_tries is set to 1 higher
>> + * than nr_balance_failed to avoid excessive task
>> + * migration at the same time. Refer to comments around
>> + * llc_balance().
>> + */
>> + if (env->sd->nr_balance_failed >= env->sd->cache_nice_tries + 1)
>> + return false;
>> +
>> + /*
>> + * We know the env->src_cpu has some tasks prefer to
>> + * run on env->dst_cpu, skip the tasks do not prefer
>> + * env->dst_cpu, and find the one that prefers.
>> + */
>> + if (env->migration_type == migrate_llc_task &&
>> + task_llc(p) != llc_id(env->dst_cpu))
>> + return true;
>
> `task_llc(p)` returns the LLC id of the CPU the task is currently running on, right?
> Wouldn’t we need to check the task’s *preferred* LLC instead?
>
> Am I missing something?
>
No, you did not miss anything; this is indeed a bug.
I realized this during our discussion yesterday:
https://lore.kernel.org/all/22e975fd-4e31-498d-a016-2168721f532a@intel.com/
will fix it accordingly. Thanks!
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-18 15:22 ` Chen, Yu C
@ 2026-02-18 17:46 ` K Prateek Nayak
2026-02-18 23:21 ` Tim Chen
2026-02-18 18:45 ` Tim Chen
1 sibling, 1 reply; 117+ messages in thread
From: K Prateek Nayak @ 2026-02-18 17:46 UTC (permalink / raw)
To: Chen, Yu C, Tim Chen
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
Hello Chenyu,
On 2/18/2026 8:52 PM, Chen, Yu C wrote:
>> On a separate note, if we add a dependency on SCHED_MC for SCHED_CACHE,
>> we can simply look at cpu_coregroup_mask() and either allocate a new
>> llc_id / borrow llc id in sched_cpu_activate() when CPU is onlined or
>> reassign them in sched_cpu_deactivate() if an entire LLC is offlined.
>>
>
> Prateek, may I know if you are thinking of updating every CPU's LLC id
> during its hotplug and not update all percpu LLC id in build_sched_domains()?
I was still thinking of build_sched_domains() (or somewhere in the
online and offline path) where we can first simply look at
cpu_coregroup_mask() and decide if we need to traverse all CPUs and
shuffle the IDs.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-10 22:18 ` [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
@ 2026-02-18 17:54 ` Madadi Vineeth Reddy
2026-02-18 21:44 ` Tim Chen
` (2 more replies)
2026-02-19 16:50 ` Peter Zijlstra
1 sibling, 3 replies; 117+ messages in thread
From: Madadi Vineeth Reddy @ 2026-02-18 17:54 UTC (permalink / raw)
To: Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Madadi Vineeth Reddy
On 11/02/26 03:48, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
>
> A performance regression was observed by Prateek when running hackbench
> with many threads per process (high fd count). To avoid this, processes
> with a large number of active threads are excluded from cache-aware
> scheduling.
>
> With sched_cache enabled, record the number of active threads in each
> process during the periodic task_cache_work(). While iterating over
> CPUs, if the currently running task belongs to the same process as the
> task that launched task_cache_work(), increment the active thread count.
>
> If the number of active threads within the process exceeds the number
> of Cores(divided by SMTs number) in the LLC, do not enable cache-aware
> scheduling. For users who wish to perform task aggregation regardless,
> a debugfs knob is provided for tuning in a subsequent patch.
>
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>
> Notes:
> v2->v3:
> Put the calculating of nr_running_avg and the use of it into 1 patch.
> (Peter Zijlstra)
>
> Use guard(rcu)() when calculating the number of active threads of the
> process.
> (Peter Zijlstra)
>
> Introduce update_avg_scale() rather than using update_avg() to fit
> system with small LLC.
> (Aaron Lu)
>
> include/linux/sched.h | 1 +
> kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 57 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c98bd1c46088..511c9b263386 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2346,6 +2346,7 @@ struct sched_cache_stat {
> struct sched_cache_time __percpu *pcpu_sched;
> raw_spinlock_t lock;
> unsigned long epoch;
> + u64 nr_running_avg;
> int cpu;
> } ____cacheline_aligned_in_smp;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d1145997b88d..86b6b08e7e1e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1223,6 +1223,19 @@ static inline bool valid_llc_buf(struct sched_domain *sd,
> return valid_llc_id(id);
> }
>
> +static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
> +{
> + int smt_nr = 1;
> +
> +#ifdef CONFIG_SCHED_SMT
> + if (sched_smt_active())
> + smt_nr = cpumask_weight(cpu_smt_mask(cpu));
> +#endif
> +
> + return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
> + per_cpu(sd_llc_size, cpu));
On Power10/Power11 with SMT4 and LLC size of 4, this check
effectively disables cache-aware scheduling for any process.
I raised this point in v1 as well. Increasing the threshold
doesn't seem like a viable solution either, as that would regress
hackbench/ebizzy.
Is there a way to make this useful for architectures with small LLC
sizes? One possible approach we were exploring is to have LLC at a
hemisphere level that comprise multiple SMT4 cores.
Thanks,
Vineeth
> +}
> +
> static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
> {
> struct sched_domain *sd;
> @@ -1417,7 +1430,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> */
> if (time_after(epoch,
> READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) ||
> - get_nr_threads(p) <= 1) {
> + get_nr_threads(p) <= 1 ||
> + exceed_llc_nr(mm, cpu_of(rq))) {
> if (mm->sc_stat.cpu != -1)
> mm->sc_stat.cpu = -1;
> }
> @@ -1458,13 +1472,31 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
> }
> }
>
> +static inline void update_avg_scale(u64 *avg, u64 sample)
> +{
> + int factor = per_cpu(sd_llc_size, raw_smp_processor_id());
> + s64 diff = sample - *avg;
> + u32 divisor;
> +
> + /*
> + * Scale the divisor based on the number of CPUs contained
> + * in the LLC. This scaling ensures smaller LLC domains use
> + * a smaller divisor to achieve more precise sensitivity to
> + * changes in nr_running, while larger LLC domains are capped
> + * at a maximum divisor of 8 which is the default smoothing
> + * factor of EWMA in update_avg().
> + */
> + divisor = clamp_t(u32, (factor >> 2), 2, 8);
> + *avg += div64_s64(diff, divisor);
> +}
> +
> static void task_cache_work(struct callback_head *work)
> {
> - struct task_struct *p = current;
> + struct task_struct *p = current, *cur;
> struct mm_struct *mm = p->mm;
> unsigned long m_a_occ = 0;
> unsigned long curr_m_a_occ = 0;
> - int cpu, m_a_cpu = -1;
> + int cpu, m_a_cpu = -1, nr_running = 0;
> cpumask_var_t cpus;
>
> WARN_ON_ONCE(work != &p->cache_work);
> @@ -1474,6 +1506,13 @@ static void task_cache_work(struct callback_head *work)
> if (p->flags & PF_EXITING)
> return;
>
> + if (get_nr_threads(p) <= 1) {
> + if (mm->sc_stat.cpu != -1)
> + mm->sc_stat.cpu = -1;
> +
> + return;
> + }
> +
> if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
> return;
>
> @@ -1497,6 +1536,12 @@ static void task_cache_work(struct callback_head *work)
> m_occ = occ;
> m_cpu = i;
> }
> + scoped_guard (rcu) {
> + cur = rcu_dereference(cpu_rq(i)->curr);
> + if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
> + cur->mm == mm)
> + nr_running++;
> + }
> }
>
> /*
> @@ -1540,6 +1585,7 @@ static void task_cache_work(struct callback_head *work)
> mm->sc_stat.cpu = m_a_cpu;
> }
>
> + update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running);
> free_cpumask_var(cpus);
> }
>
> @@ -9988,6 +10034,13 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
> if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
> return mig_unrestricted;
>
> + /* skip cache aware load balance for single/too many threads */
> + if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu)) {
> + if (mm->sc_stat.cpu != -1)
> + mm->sc_stat.cpu = -1;
> + return mig_unrestricted;
> + }
> +
> if (cpus_share_cache(dst_cpu, cpu))
> to_pref = true;
> else if (cpus_share_cache(src_cpu, cpu))
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-18 15:22 ` Chen, Yu C
2026-02-18 17:46 ` K Prateek Nayak
@ 2026-02-18 18:45 ` Tim Chen
1 sibling, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-18 18:45 UTC (permalink / raw)
To: Chen, Yu C, K Prateek Nayak
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
On Wed, 2026-02-18 at 23:22 +0800, Chen, Yu C wrote:
> On 2/18/2026 11:28 AM, K Prateek Nayak wrote:
> > Hello Tim,
> >
> > On 2/18/2026 4:42 AM, Tim Chen wrote:
> > > On Tue, 2026-02-17 at 13:39 +0530, K Prateek Nayak wrote:
> > > > Hello Chenyu,
> > > >
> > > >
> > >
> > > [...snip...]
> > >
> > >
> > > > > > > */
> > > > > > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > > > > > > DEFINE_PER_CPU(int, sd_llc_size);
> > > > > > > -DEFINE_PER_CPU(int, sd_llc_id);
> > > > > > > +DEFINE_PER_CPU(int, sd_llc_id) = -1;
> > > > > > > DEFINE_PER_CPU(int, sd_share_id);
> > > > > > > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > > > > > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > > > > > > @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
> > > > > > > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> > > > > > > per_cpu(sd_llc_size, cpu) = size;
> > > > > > > - per_cpu(sd_llc_id, cpu) = id;
> > > > > > > rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> > > > > > > sd = lowest_flag_domain(cpu, SD_CLUSTER);
> > > > > > > @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > > > > > > /* Set up domains for CPUs specified by the cpu_map: */
> > > > > > > for_each_cpu(i, cpu_map) {
> > > > > > > - struct sched_domain_topology_level *tl;
> > > > > > > + struct sched_domain_topology_level *tl, *tl_llc = NULL;
> > > > > > > + int lid;
> > > > > > > sd = NULL;
> > > > > > > for_each_sd_topology(tl) {
> > > > > > > + int flags = 0;
> > > > > > > +
> > > > > > > + if (tl->sd_flags)
> > > > > > > + flags = (*tl->sd_flags)();
> > > > > > > +
> > > > > > > + if (flags & SD_SHARE_LLC)
> > > > > > > + tl_llc = tl;
> > > > > >
> > > > > > nit. This loop breaks out when sched_domain_span(sd) covers the entire
> > > > > > cpu_map and it might have not reached the topmost SD_SHARE_LLC domain
> > > > > > yet. Is that cause for any concern?
> > > > > >
> > > > >
> > > > > Could you please elaborate a little more on this? If it covers the
> > > > > entire cpu_map shouldn't it stop going up to its parent domain?
> > > > > Do you mean, sd_llc_1 and its parent sd_llc_2 could cover the same cpu_map,
> > > > > and we should let tl_llc to assigned to sd_llc_2 (sd_llc_1 be degenerated? )
> > > >
> > > > I'm not sure if this is technically possible but assume following
> > > > topology:
> > > >
> > > > [ LLC: 8-15 ]
> > > > [ SMT: 8,9 ][ SMT: 10,11 ] ... [ SMT: 14,15 ]
> > > >
> > > > and the following series of events:
> > > >
> > > > o All CPUs in LLC are offline to begin with (maxcpus = 1 like scenario).
> > > >
> > > > o CPUs 10-15 are onlined first.
> > > >
> > > > o CPU8 is put in a separate root partition and brought online.
> > > > (XXX: I'm not 100% sure if this is possible in this order)
> > > >
> > > > o build_sched_domains() will bail out at SMT domain since the cpumap
> > > > is covered by tl->mask() and tl_llc = tl_smt.
> > > >
> > > > o llc_id calculation uses the tl_smt->mask() which will not contain
> > > > CPUs 10-15 and CPU8 will get a unique LLC id even though there are
> > > > other online CPUs in the LLC with a different llc_id (!!!)
> > > >
> > > >
> > > > Instead, if we traversed to tl_mc, we would have seen all the online
> > > > CPUs in the MC and reused the llc_id from them. Might not be an issue on
> > > > its own but if this root partition is removed later, CPU8 will continue
> > > > to have the unique llc_id even after merging into the same MC domain.
> > >
> > > There is really no reason to reuse the llc_id as far as cache aware scheduling
> > > goes in its v3 revision (see my reply to Madadi on this patch).
> >
> > Even I don't mind having some holes in the llc_id space when CPUs are
> > offlined but my major concern would be seeing an inconsistent state
> > where CPUs in same MC domains end up with different llc_id when after
> > a bunch of hotplug activity.
> >
> > >
> > > I am thinking that if we just simply rebuild LLC id across sched domain
> > > rebuilds, that is probably the cleanest solution.
>
> Tim, do you mean reset all CPUs' LLC id to -1 whenever there is hotplug
> event in partition_sched_domains_locked(), and rebuild them from scratch
> in build_sched_domains(), so we already refresh the LLC id for every
> CPU(I discussed with Vineeth here:
> https://lore.kernel.org/all/54e60704-b0f3-44df-9b83-070806b5a00c@intel.com/)
Yes, that's what I was thinking. However, there could be some races in
cpus_share_cache() with this approach.
Tim
>
>
> > > There could be some races
> > > in cpus_share_cache() as llc_id gets reassigned for some CPUs when they
> > > come online/offline. But we also having similar races in current mainline code.
> > > Worst it can do is some temporary sub-optimal scheduling task placement.
> > >
> > > Thoughts?
> >
> > If you are suggesting populating the sd_llc_id for all the CPUs on
> > topology rebuild, I'm not entirely against the idea.
> >
> > On a separate note, if we add a dependency on SCHED_MC for SCHED_CACHE,
> > we can simply look at cpu_coregroup_mask() and either allocate a new
> > llc_id / borrow llc id in sched_cpu_activate() when CPU is onlined or
> > reassign them in sched_cpu_deactivate() if an entire LLC is offlined.
> >
>
> Prateek, may I know if you are thinking of updating every CPU's LLC id
> during its hotplug and not update all percpu LLC id in
> build_sched_domains()?
>
> thanks,
> Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-18 3:28 ` K Prateek Nayak
2026-02-18 15:22 ` Chen, Yu C
@ 2026-02-18 21:33 ` Tim Chen
1 sibling, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-18 21:33 UTC (permalink / raw)
To: K Prateek Nayak, Chen, Yu C
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
On Wed, 2026-02-18 at 08:58 +0530, K Prateek Nayak wrote:
> Hello Tim,
>
> On 2/18/2026 4:42 AM, Tim Chen wrote:
> > On Tue, 2026-02-17 at 13:39 +0530, K Prateek Nayak wrote:
> > > Hello Chenyu,
> > >
> > >
> >
> > [...snip...]
> >
> >
> > > > > > */
> > > > > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > > > > > DEFINE_PER_CPU(int, sd_llc_size);
> > > > > > -DEFINE_PER_CPU(int, sd_llc_id);
> > > > > > +DEFINE_PER_CPU(int, sd_llc_id) = -1;
> > > > > > DEFINE_PER_CPU(int, sd_share_id);
> > > > > > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > > > > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > > > > > @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
> > > > > > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> > > > > > per_cpu(sd_llc_size, cpu) = size;
> > > > > > - per_cpu(sd_llc_id, cpu) = id;
> > > > > > rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> > > > > > sd = lowest_flag_domain(cpu, SD_CLUSTER);
> > > > > > @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > > > > > /* Set up domains for CPUs specified by the cpu_map: */
> > > > > > for_each_cpu(i, cpu_map) {
> > > > > > - struct sched_domain_topology_level *tl;
> > > > > > + struct sched_domain_topology_level *tl, *tl_llc = NULL;
> > > > > > + int lid;
> > > > > > sd = NULL;
> > > > > > for_each_sd_topology(tl) {
> > > > > > + int flags = 0;
> > > > > > +
> > > > > > + if (tl->sd_flags)
> > > > > > + flags = (*tl->sd_flags)();
> > > > > > +
> > > > > > + if (flags & SD_SHARE_LLC)
> > > > > > + tl_llc = tl;
> > > > >
> > > > > nit. This loop breaks out when sched_domain_span(sd) covers the entire
> > > > > cpu_map and it might have not reached the topmost SD_SHARE_LLC domain
> > > > > yet. Is that cause for any concern?
> > > > >
> > > >
> > > > Could you please elaborate a little more on this? If it covers the
> > > > entire cpu_map shouldn't it stop going up to its parent domain?
> > > > Do you mean, sd_llc_1 and its parent sd_llc_2 could cover the same cpu_map,
> > > > and we should let tl_llc to assigned to sd_llc_2 (sd_llc_1 be degenerated? )
> > >
> > > I'm not sure if this is technically possible but assume following
> > > topology:
> > >
> > > [ LLC: 8-15 ]
> > > [ SMT: 8,9 ][ SMT: 10,11 ] ... [ SMT: 14,15 ]
> > >
> > > and the following series of events:
> > >
> > > o All CPUs in LLC are offline to begin with (maxcpus = 1 like scenario).
> > >
> > > o CPUs 10-15 are onlined first.
> > >
> > > o CPU8 is put in a separate root partition and brought online.
> > > (XXX: I'm not 100% sure if this is possible in this order)
> > >
> > > o build_sched_domains() will bail out at SMT domain since the cpumap
> > > is covered by tl->mask() and tl_llc = tl_smt.
> > >
> > > o llc_id calculation uses the tl_smt->mask() which will not contain
> > > CPUs 10-15 and CPU8 will get a unique LLC id even though there are
> > > other online CPUs in the LLC with a different llc_id (!!!)
> > >
> > >
> > > Instead, if we traversed to tl_mc, we would have seen all the online
> > > CPUs in the MC and reused the llc_id from them. Might not be an issue on
> > > its own but if this root partition is removed later, CPU8 will continue
> > > to have the unique llc_id even after merging into the same MC domain.
> >
> > There is really no reason to reuse the llc_id as far as cache aware scheduling
> > goes in its v3 revision (see my reply to Madadi on this patch).
>
> Even I don't mind having some holes in the llc_id space when CPUs are
> offlined but my major concern would be seeing an inconsistent state
> where CPUs in same MC domains end up with different llc_id when after
> a bunch of hotplug activity.
>
> >
> > I am thinking that if we just simply rebuild LLC id across sched domain
> > rebuilds, that is probably the cleanest solution. There could be some races
> > in cpus_share_cache() as llc_id gets reassigned for some CPUs when they
> > come online/offline. But we also having similar races in current mainline code.
> > Worst it can do is some temporary sub-optimal scheduling task placement.
> >
> > Thoughts?
>
> If you are suggesting populating the sd_llc_id for all the CPUs on
> topology rebuild, I'm not entirely against the idea.
>
> On a separate note, if we add a dependency on SCHED_MC for SCHED_CACHE,
> we can simply look at cpu_coregroup_mask() and either allocate a new
> llc_id / borrow llc id in sched_cpu_activate() when CPU is onlined or
> reassign them in sched_cpu_deactivate() if an entire LLC is offlined.
I also think that cpu_coregroup_mask() is a better choice than
tl->mask for getting the mask of CPUs in LLC.
Okay, we'll consider an implemenation along your suggestion of
__sched_domains_alloc_llc_id() to reuse llc id when all CPUs
in LLC deactivate. That will minimize holes in LLC ids while
avoiding races in cpus_share_cache().
Thanks.
Tim
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-18 17:54 ` Madadi Vineeth Reddy
@ 2026-02-18 21:44 ` Tim Chen
2026-02-19 2:28 ` Madadi Vineeth Reddy
2026-02-19 16:52 ` Peter Zijlstra
2026-02-19 16:55 ` Peter Zijlstra
2 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-18 21:44 UTC (permalink / raw)
To: Madadi Vineeth Reddy
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Wed, 2026-02-18 at 23:24 +0530, Madadi Vineeth Reddy wrote:
> On 11/02/26 03:48, Tim Chen wrote:
> > From: Chen Yu <yu.c.chen@intel.com>
> >
> >
[ .. snip ..]
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d1145997b88d..86b6b08e7e1e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1223,6 +1223,19 @@ static inline bool valid_llc_buf(struct sched_domain *sd,
> > return valid_llc_id(id);
> > }
> >
> > +static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
> > +{
> > + int smt_nr = 1;
> > +
> > +#ifdef CONFIG_SCHED_SMT
> > + if (sched_smt_active())
> > + smt_nr = cpumask_weight(cpu_smt_mask(cpu));
> > +#endif
> > +
> > + return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
> > + per_cpu(sd_llc_size, cpu));
>
>
> On Power10/Power11 with SMT4 and LLC size of 4, this check
> effectively disables cache-aware scheduling for any process.
There are 4 cores per LLC, with 4 SMT per core? In that case, once we have more than
4 running threads and there's another idle LLC available, seems
like putting the additional thread on a different LLC is the
right thing to do as threads sharing a core will usually be much
slower.
But when number of threads are under 4, we should still be
doing aggregation.
Perhaps I am misunderstanding your topology.
Tim
>
> I raised this point in v1 as well. Increasing the threshold
> doesn't seem like a viable solution either, as that would regress
> hackbench/ebizzy.
>
> Is there a way to make this useful for architectures with small LLC
> sizes? One possible approach we were exploring is to have LLC at a
> hemisphere level that comprise multiple SMT4 cores.
>
> Thanks,
> Vineeth
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-18 17:46 ` K Prateek Nayak
@ 2026-02-18 23:21 ` Tim Chen
2026-02-19 6:12 ` K Prateek Nayak
2026-02-19 11:25 ` Chen, Yu C
0 siblings, 2 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-18 23:21 UTC (permalink / raw)
To: K Prateek Nayak, Chen, Yu C
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
On Wed, 2026-02-18 at 23:16 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 2/18/2026 8:52 PM, Chen, Yu C wrote:
> > > On a separate note, if we add a dependency on SCHED_MC for SCHED_CACHE,
> > > we can simply look at cpu_coregroup_mask() and either allocate a new
> > > llc_id / borrow llc id in sched_cpu_activate() when CPU is onlined or
> > > reassign them in sched_cpu_deactivate() if an entire LLC is offlined.
> > >
> >
> > Prateek, may I know if you are thinking of updating every CPU's LLC id
> > during its hotplug and not update all percpu LLC id in build_sched_domains()?
>
> I was still thinking of build_sched_domains() (or somewhere in the
> online and offline path) where we can first simply look at
> cpu_coregroup_mask() and decide if we need to traverse all CPUs and
> shuffle the IDs.
Pratek,
How about modifying the patch like the following, stealing
a lot of your code. Also added change to shrink max LLCs when
the LLC with max id lost its last CPU.
Thanks.
Tim
---
diff --git a/init/Kconfig b/init/Kconfig
index 9848de949afa..4ddf54ab9cf7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -987,6 +987,7 @@ config SCHED_CACHE
bool "Cache aware load balance"
default y
depends on SMP
+ depends on SCHED_MC
help
When enabled, the scheduler will attempt to aggregate tasks from
the same process onto a single Last Level Cache (LLC) domain when
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48626c81ba8e..75ba4e0bfcd3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8474,6 +8474,8 @@ int sched_cpu_deactivate(unsigned int cpu)
*/
synchronize_rcu();
+ sched_domains_free_llc_id(cpu);
+
sched_set_rq_offline(rq, cpu);
scx_rq_deactivate(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6cbc56e9adfc..04f42526e6f0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3862,6 +3862,7 @@ static inline bool sched_cache_enabled(void)
extern void sched_cache_active_set_unlocked(void);
#endif
extern void init_sched_mm(struct task_struct *p);
+void sched_domains_free_llc_id(int cpu);
extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 580fb2fbc900..5e59340ad9a9 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -18,6 +18,7 @@ void sched_domains_mutex_unlock(void)
}
/* Protected by sched_domains_mutex: */
+static cpumask_var_t sched_domains_llc_id_allocmask;
static cpumask_var_t sched_domains_tmpmask;
static cpumask_var_t sched_domains_tmpmask2;
static int tl_max_llcs;
@@ -2590,6 +2591,57 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
return true;
}
+static int __sched_domains_alloc_llc_id(void)
+{
+ int lid;
+
+ lockdep_assert_held(&sched_domains_mutex);
+
+ lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
+ if (lid >= tl_max_llcs)
+ tl_max_llcs = lid + 1;
+
+ /*
+ * llc_id space should never grow larger than the
+ * possible number of CPUs in the system.
+ */
+ if (!unlikely(WARN_ON_ONCE(lid >= nr_cpumask_bits)))
+ cpumask_set_cpu(lid, sched_domains_llc_id_allocmask);
+ return lid;
+}
+
+static void __sched_domains_free_llc_id(int cpu)
+{
+ int i, lid;
+
+ lockdep_assert_held(&sched_domains_mutex);
+
+ lid = per_cpu(sd_llc_id, cpu);
+ if (lid == -1)
+ return;
+
+ per_cpu(sd_llc_id, cpu) = -1;
+
+ for_each_online_cpu(i) {
+ /* An online CPU owns the llc_id. */
+ if (per_cpu(sd_llc_id, i) == lid)
+ return;
+ }
+
+ cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
+
+ /* shrink max LLC size to save memory */
+ if (lid == tl_max_llcs - 1)
+ lid = tl_max_llcs--;
+}
+
+void sched_domains_free_llc_id(int cpu)
+{
+ sched_domains_mutex_lock();
+ __sched_domains_free_llc_id(cpu);
+ sched_domains_mutex_unlock();
+}
+
/*
* Build sched domains for a given set of CPUs and attach the sched domains
* to the individual CPUs
@@ -2615,18 +2667,11 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* Set up domains for CPUs specified by the cpu_map: */
for_each_cpu(i, cpu_map) {
- struct sched_domain_topology_level *tl, *tl_llc = NULL;
+ struct sched_domain_topology_level *tl;
int lid;
sd = NULL;
for_each_sd_topology(tl) {
- int flags = 0;
-
- if (tl->sd_flags)
- flags = (*tl->sd_flags)();
-
- if (flags & SD_SHARE_LLC)
- tl_llc = tl;
sd = build_sched_domain(tl, cpu_map, attr, sd, i);
@@ -2642,18 +2687,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
if (lid == -1) {
int j;
+ j = cpumask_first(cpu_coregroup_mask(i));
/*
* Assign the llc_id to the CPUs that do not
* have an LLC.
*/
- if (!tl_llc) {
- per_cpu(sd_llc_id, i) = tl_max_llcs++;
+ if (j >= nr_cpu_ids) {
+ per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
continue;
}
/* try to reuse the llc_id of its siblings */
- for_each_cpu(j, tl_llc->mask(tl_llc, i)) {
+ for (; j < nr_cpu_ids; j = cpumask_next(j, cpu_coregroup_mask(i))) {
if (i == j)
continue;
@@ -2668,7 +2714,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* a new LLC is detected */
if (lid == -1)
- per_cpu(sd_llc_id, i) = tl_max_llcs++;
+ per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
}
}
@@ -2869,6 +2915,7 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
{
int err;
+ zalloc_cpumask_var(&sched_domains_llc_id_allocmask, GFP_KERNEL);
zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
^ permalink raw reply related [flat|nested] 117+ messages in thread
* Re: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-18 21:44 ` Tim Chen
@ 2026-02-19 2:28 ` Madadi Vineeth Reddy
2026-02-19 14:38 ` Chen, Yu C
2026-02-19 21:12 ` Tim Chen
0 siblings, 2 replies; 117+ messages in thread
From: Madadi Vineeth Reddy @ 2026-02-19 2:28 UTC (permalink / raw)
To: Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Madadi Vineeth Reddy
On 19/02/26 03:14, Tim Chen wrote:
> On Wed, 2026-02-18 at 23:24 +0530, Madadi Vineeth Reddy wrote:
>> On 11/02/26 03:48, Tim Chen wrote:
>>> From: Chen Yu <yu.c.chen@intel.com>
>>>
>>>
> [ .. snip ..]
>
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index d1145997b88d..86b6b08e7e1e 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1223,6 +1223,19 @@ static inline bool valid_llc_buf(struct sched_domain *sd,
>>> return valid_llc_id(id);
>>> }
>>>
>>> +static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>>> +{
>>> + int smt_nr = 1;
>>> +
>>> +#ifdef CONFIG_SCHED_SMT
>>> + if (sched_smt_active())
>>> + smt_nr = cpumask_weight(cpu_smt_mask(cpu));
>>> +#endif
>>> +
>>> + return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
>>> + per_cpu(sd_llc_size, cpu));
>>
>>
>> On Power10/Power11 with SMT4 and LLC size of 4, this check
>> effectively disables cache-aware scheduling for any process.
>
> There are 4 cores per LLC, with 4 SMT per core? In that case, once we have more than
> 4 running threads and there's another idle LLC available, seems
> like putting the additional thread on a different LLC is the
> right thing to do as threads sharing a core will usually be much
> slower.
>
> But when number of threads are under 4, we should still be
> doing aggregation.
>
> Perhaps I am misunderstanding your topology.
There is only one core per LLC whose size is 4 CPUs.
So, mm->sc_stat.nr_running_avg can't be >= 1 for
cache aware scheduling to be enabled.
Thanks,
Vineeth
>
> Tim
>
>>
>> I raised this point in v1 as well. Increasing the threshold
>> doesn't seem like a viable solution either, as that would regress
>> hackbench/ebizzy.
>>
>> Is there a way to make this useful for architectures with small LLC
>> sizes? One possible approach we were exploring is to have LLC at a
>> hemisphere level that comprise multiple SMT4 cores.
>>
>> Thanks,
>> Vineeth
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-18 23:21 ` Tim Chen
@ 2026-02-19 6:12 ` K Prateek Nayak
2026-02-19 15:51 ` Peter Zijlstra
2026-02-20 0:11 ` Tim Chen
2026-02-19 11:25 ` Chen, Yu C
1 sibling, 2 replies; 117+ messages in thread
From: K Prateek Nayak @ 2026-02-19 6:12 UTC (permalink / raw)
To: Tim Chen, Chen, Yu C
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
Hello Tim,
Thank you for the patch.
On 2/19/2026 4:51 AM, Tim Chen wrote:
> diff --git a/init/Kconfig b/init/Kconfig
> index 9848de949afa..4ddf54ab9cf7 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -987,6 +987,7 @@ config SCHED_CACHE
> bool "Cache aware load balance"
> default y
> depends on SMP
> + depends on SCHED_MC
> help
> When enabled, the scheduler will attempt to aggregate tasks from
> the same process onto a single Last Level Cache (LLC) domain when
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 48626c81ba8e..75ba4e0bfcd3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8474,6 +8474,8 @@ int sched_cpu_deactivate(unsigned int cpu)
> */
> synchronize_rcu();
>
> + sched_domains_free_llc_id(cpu);
> +
> sched_set_rq_offline(rq, cpu);
>
> scx_rq_deactivate(rq);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6cbc56e9adfc..04f42526e6f0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3862,6 +3862,7 @@ static inline bool sched_cache_enabled(void)
> extern void sched_cache_active_set_unlocked(void);
> #endif
> extern void init_sched_mm(struct task_struct *p);
> +void sched_domains_free_llc_id(int cpu);
>
> extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
> extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 580fb2fbc900..5e59340ad9a9 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -18,6 +18,7 @@ void sched_domains_mutex_unlock(void)
> }
>
> /* Protected by sched_domains_mutex: */
> +static cpumask_var_t sched_domains_llc_id_allocmask;
> static cpumask_var_t sched_domains_tmpmask;
> static cpumask_var_t sched_domains_tmpmask2;
> static int tl_max_llcs;
> @@ -2590,6 +2591,57 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
> return true;
> }
>
> +static int __sched_domains_alloc_llc_id(void)
> +{
> + int lid;
> +
> + lockdep_assert_held(&sched_domains_mutex);
> +
> + lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
> + if (lid >= tl_max_llcs)
> + tl_max_llcs = lid + 1;
> +
> + /*
> + * llc_id space should never grow larger than the
> + * possible number of CPUs in the system.
> + */
> + if (!unlikely(WARN_ON_ONCE(lid >= nr_cpumask_bits)))
> + cpumask_set_cpu(lid, sched_domains_llc_id_allocmask);
> + return lid;
> +}
> +
> +static void __sched_domains_free_llc_id(int cpu)
> +{
> + int i, lid;
> +
> + lockdep_assert_held(&sched_domains_mutex);
> +
> + lid = per_cpu(sd_llc_id, cpu);
> + if (lid == -1)
> + return;
> +
> + per_cpu(sd_llc_id, cpu) = -1;
> +
> + for_each_online_cpu(i) {
> + /* An online CPU owns the llc_id. */
> + if (per_cpu(sd_llc_id, i) == lid)
> + return;
> + }
We should perhaps warn and skip clearing lid from cpumask if lid was
found to be larger than "nr_cpumask_bits". Shouldn't happen but just
as a precaution.
> +
> + cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
> +
> + /* shrink max LLC size to save memory */
> + if (lid == tl_max_llcs - 1)
> + lid = tl_max_llcs--;
No need to assign the local "lid" variable here; Simple decrement
should do.
> +}
> +
> +void sched_domains_free_llc_id(int cpu)
> +{
> + sched_domains_mutex_lock();
> + __sched_domains_free_llc_id(cpu);
> + sched_domains_mutex_unlock();
> +}
> +
> /*
> * Build sched domains for a given set of CPUs and attach the sched domains
> * to the individual CPUs
> @@ -2615,18 +2667,11 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>
> /* Set up domains for CPUs specified by the cpu_map: */
> for_each_cpu(i, cpu_map) {
> - struct sched_domain_topology_level *tl, *tl_llc = NULL;
> + struct sched_domain_topology_level *tl;
> int lid;
>
> sd = NULL;
> for_each_sd_topology(tl) {
> - int flags = 0;
> -
> - if (tl->sd_flags)
> - flags = (*tl->sd_flags)();
> -
> - if (flags & SD_SHARE_LLC)
> - tl_llc = tl;
>
> sd = build_sched_domain(tl, cpu_map, attr, sd, i);
>
> @@ -2642,18 +2687,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> if (lid == -1) {
> int j;
>
> + j = cpumask_first(cpu_coregroup_mask(i));
> /*
> * Assign the llc_id to the CPUs that do not
> * have an LLC.
> */
> - if (!tl_llc) {
> - per_cpu(sd_llc_id, i) = tl_max_llcs++;
> + if (j >= nr_cpu_ids) {
> + per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
>
> continue;
> }
I don't think we need to special case this out since:
for_each_cpu(j, cpu_coregroup_mask(i)) {
...
}
would bail out if no CPU is set (also CPU "i" would definitely be
set on it since it must be online) and the "if" after the loop will
see "lid" as "-1" and DTRT.
>
> /* try to reuse the llc_id of its siblings */
> - for_each_cpu(j, tl_llc->mask(tl_llc, i)) {
> + for (; j < nr_cpu_ids; j = cpumask_next(j, cpu_coregroup_mask(i))) {
> if (i == j)
> continue;
>
> @@ -2668,7 +2714,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>
> /* a new LLC is detected */
> if (lid == -1)
> - per_cpu(sd_llc_id, i) = tl_max_llcs++;
> + per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
> }
> }
>
> @@ -2869,6 +2915,7 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
> {
> int err;
>
> + zalloc_cpumask_var(&sched_domains_llc_id_allocmask, GFP_KERNEL);
> zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
> zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
> zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-18 23:21 ` Tim Chen
2026-02-19 6:12 ` K Prateek Nayak
@ 2026-02-19 11:25 ` Chen, Yu C
2026-02-19 16:10 ` K Prateek Nayak
1 sibling, 1 reply; 117+ messages in thread
From: Chen, Yu C @ 2026-02-19 11:25 UTC (permalink / raw)
To: Tim Chen, K Prateek Nayak
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
On 2/19/2026 7:21 AM, Tim Chen wrote:
> On Wed, 2026-02-18 at 23:16 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> On 2/18/2026 8:52 PM, Chen, Yu C wrote:
[ ... ]
>
> Pratek,
>
> How about modifying the patch like the following, stealing
> a lot of your code. Also added change to shrink max LLCs when
> the LLC with max id lost its last CPU.
>
> Thanks.
>
[ ... ]
> +
> +static void __sched_domains_free_llc_id(int cpu)
> +{
> + int i, lid;
> +
> + lockdep_assert_held(&sched_domains_mutex);
> +
> + lid = per_cpu(sd_llc_id, cpu);
> + if (lid == -1)
> + return;
> +
> + per_cpu(sd_llc_id, cpu) = -1;
> +
> + for_each_online_cpu(i) {
One minor question: should we only iterate through
cpu_coregroup_mask(cpu) to check if any sibling CPU
within this LLC owns the llc_id? If there are no online
CPUs within this LLC, I assume we should release this
llc_id.
thanks,
Chenyu
> + /* An online CPU owns the llc_id. */
> + if (per_cpu(sd_llc_id, i) == lid)
> + return;
> + }
> +
> + cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 03/21] sched/cache: Introduce helper functions to enforce LLC migration policy
2026-02-10 22:18 ` [PATCH v3 03/21] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
2026-02-14 16:12 ` Madadi Vineeth Reddy
@ 2026-02-19 11:29 ` Peter Zijlstra
2026-02-19 14:48 ` Chen, Yu C
1 sibling, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-19 11:29 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Tue, Feb 10, 2026 at 02:18:43PM -0800, Tim Chen wrote:
> +/*
> + * The margin used when comparing utilization.
> + * is 'util1' noticeably greater than 'util2'
> + * Derived from capacity_greater().
> + * Bias is in perentage.
> + */
> +/* Allows dst util to be bigger than src util by up to bias percent */
> +#define util_greater(util1, util2) \
> + ((util1) * 100 > (util2) * 120)
> + * 20% is the utilization imbalance percentage to decide
> + * if the preferred LLC is busier than the non-preferred LLC.
> + * 20 is a little higher than the LLC domain's imbalance_pct
> + * 17. The hysteresis is used to avoid task bouncing between the
> + * preferred LLC and the non-preferred LLC.
So not saying this needs changing now, but consider that imbalance_pct
can be changed through debugfs. Do you want this 120 to be expressed in
terms of imbalance_pct rather than being hardcoded?
Anyway, let me read on..
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-10 22:18 ` [PATCH v3 04/21] sched/cache: Make LLC id continuous Tim Chen
2026-02-14 17:53 ` Madadi Vineeth Reddy
2026-02-16 7:44 ` K Prateek Nayak
@ 2026-02-19 11:35 ` Peter Zijlstra
2026-02-19 18:17 ` Tim Chen
2026-02-19 14:59 ` Peter Zijlstra
3 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-19 11:35 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Tue, Feb 10, 2026 at 02:18:44PM -0800, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
>
> Introduce an index mapping between CPUs and their LLCs. This provides
> a continuous per LLC index needed for cache-aware load balancing in
> later patches.
>
> The existing per_cpu llc_id usually points to the first CPU of the
> LLC domain, which is sparse and unsuitable as an array index. Using
> llc_id directly would waste memory.
>
> With the new mapping, CPUs in the same LLC share a continuous id:
>
> per_cpu(llc_id, CPU=0...15) = 0
> per_cpu(llc_id, CPU=16...31) = 1
> per_cpu(llc_id, CPU=32...47) = 2
> ...
>
> Once a CPU has been assigned an llc_id, this ID persists even when
> the CPU is taken offline and brought back online, which can facilitate
> the management of the ID.
>
> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Note that Tim is the one sending this email, so his SOB should be last.
It is also fine to have a SOB occur multiple times in a chain.
Please double check all these SOB chains, because I think this isn't the
first one that isn't right (possibly the very first patch already has
problems).
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
` (20 preceding siblings ...)
2026-02-10 22:19 ` [PATCH v3 21/21] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Tim Chen
@ 2026-02-19 14:08 ` Qais Yousef
2026-02-19 14:41 ` Peter Zijlstra
21 siblings, 1 reply; 117+ messages in thread
From: Qais Yousef @ 2026-02-19 14:08 UTC (permalink / raw)
To: Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Libo Chen, linux-kernel
On 02/10/26 14:18, Tim Chen wrote:
> This patch series introduces infrastructure for cache-aware load
> balancing, with the goal of co-locating tasks that share data within
> the same Last Level Cache (LLC) domain. By improving cache locality,
> the scheduler can reduce cache bouncing and cache misses, ultimately
> improving data access efficiency. The design builds on the initial
> prototype from Peter [1].
>
> This initial implementation treats threads within the same process as
> entities that are likely to share data. During load balancing, the
This is a very aggressive assumption. From what I've seen, only few tasks truly
share data. Lumping everything in a process together is an easy way to
classify, but I think we can do better.
> scheduler attempts to aggregate such threads onto the same LLC domain
> whenever possible.
I admit yet to look fully at the series. But I must ask, why are you deferring
to load balance and not looking at wake up path? LB should be for corrections.
When wake up path is doing wrong decision all the time, LB (which is super slow
to react) is too late to start grouping tasks? What am I missing?
In my head Core Scheduling is already doing what we want. We just need to
extend it to be a bit more relaxed (best effort rather than completely strict
for security reasons today). This will be a lot more flexible and will allow
tasks to be co-located from the get-go. And it will defer the responsibility of
tagging to userspace. If they do better or worse, it's on them :) It seems you
already hit a corner case where the grouping was a bad idea and doing some
magic with thread numbers to alleviate it.
FWIW I have come across cases on mobile world were co-locating on a cluster or
a 'big' core with big L2 cache can benefit a small group of tasks. So the
concept is generally beneficial as cache hierarchies are not symmetrical in
more systems now. Even on symmetrical systems, there can be cases made where
two small data dependent task can benefit from packing on a single CPU.
I know this changes the direction being made here; but I strongly believe the
right way is to extend wake up path rather than lump it solely in LB (IIUC).
Note I am looking at NETLINK to enable our proposed Sched QoS library to listen
to critical events like a process being created and tasks being forked to auto
tag them. Userspace would be easily able to tag individual tasks as
co-dependent or ask for a whole process to be tagged as such (assign the same
cookie to all forked tasks for that process). We should not need to do any
magic in the kernel then other than provide the mechanisms to shoot themselves
in the foot (or do better ;-))
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-19 2:28 ` Madadi Vineeth Reddy
@ 2026-02-19 14:38 ` Chen, Yu C
2026-02-19 21:12 ` Tim Chen
1 sibling, 0 replies; 117+ messages in thread
From: Chen, Yu C @ 2026-02-19 14:38 UTC (permalink / raw)
To: Madadi Vineeth Reddy, Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li,
Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef, Libo Chen,
linux-kernel
Hi Vineeth,
On 2/19/2026 10:28 AM, Madadi Vineeth Reddy wrote:
> On 19/02/26 03:14, Tim Chen wrote:
>> On Wed, 2026-02-18 at 23:24 +0530, Madadi Vineeth Reddy wrote:
>>> On 11/02/26 03:48, Tim Chen wrote:
>>>> From: Chen Yu <yu.c.chen@intel.com>
>>>>
>>>>
>> [ .. snip ..]
>>
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index d1145997b88d..86b6b08e7e1e 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -1223,6 +1223,19 @@ static inline bool valid_llc_buf(struct sched_domain *sd,
>>>> return valid_llc_id(id);
>>>> }
>>>>
>>>> +static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>>>> +{
>>>> + int smt_nr = 1;
>>>> +
>>>> +#ifdef CONFIG_SCHED_SMT
>>>> + if (sched_smt_active())
>>>> + smt_nr = cpumask_weight(cpu_smt_mask(cpu));
>>>> +#endif
>>>> +
>>>> + return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
>>>> + per_cpu(sd_llc_size, cpu));
>>>
>>>
>>> On Power10/Power11 with SMT4 and LLC size of 4, this check
>>> effectively disables cache-aware scheduling for any process.
>>
>> There are 4 cores per LLC, with 4 SMT per core? In that case, once we have more than
>> 4 running threads and there's another idle LLC available, seems
>> like putting the additional thread on a different LLC is the
>> right thing to do as threads sharing a core will usually be much
>> slower.
>>
>> But when number of threads are under 4, we should still be
>> doing aggregation.
>>
>> Perhaps I am misunderstanding your topology.
>
> There is only one core per LLC whose size is 4 CPUs.
> So, mm->sc_stat.nr_running_avg can't be >= 1 for
> cache aware scheduling to be enabled.
>
There is a scale factor in the final step that can be tuned by
the user space:
exceeded = !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
(scale * per_cpu(sd_llc_size, cpu)));
So if the user increases the llc_aggr_tolerance via debugfs,
the cache aware aggregation is still enabled. Or do you suggest
to tune the nr_running check and the RSS check via different
debugfs knobs?
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-19 14:08 ` [PATCH v3 00/21] Cache Aware Scheduling Qais Yousef
@ 2026-02-19 14:41 ` Peter Zijlstra
2026-02-19 15:07 ` Chen, Yu C
2026-02-19 19:48 ` Qais Yousef
0 siblings, 2 replies; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-19 14:41 UTC (permalink / raw)
To: Qais Yousef
Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Libo Chen, linux-kernel
On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> On 02/10/26 14:18, Tim Chen wrote:
> > This patch series introduces infrastructure for cache-aware load
> > balancing, with the goal of co-locating tasks that share data within
> > the same Last Level Cache (LLC) domain. By improving cache locality,
> > the scheduler can reduce cache bouncing and cache misses, ultimately
> > improving data access efficiency. The design builds on the initial
> > prototype from Peter [1].
> >
> > This initial implementation treats threads within the same process as
> > entities that are likely to share data. During load balancing, the
>
> This is a very aggressive assumption. From what I've seen, only few tasks truly
> share data. Lumping everything in a process together is an easy way to
> classify, but I think we can do better.
Not without more information. And that is something we can always add
later. But like you well know, it is an uphill battle to get programs to
explain/annotate themselves.
The alternative is sampling things using the PMU, see which process is
trying to access which data, but that too is non-trivial, not to mention
it will get people really upset for consuming PMU resources.
Starting things with a simple assumption is fine. This can always be
extended. Gotta start somewhere and all that. It currently groups things
by mm_struct, but it would be fairly straight forward to allow userspace
to group tasks manually.
> > scheduler attempts to aggregate such threads onto the same LLC domain
> > whenever possible.
>
> I admit yet to look fully at the series. But I must ask, why are you deferring
> to load balance and not looking at wake up path? LB should be for corrections.
> When wake up path is doing wrong decision all the time, LB (which is super slow
> to react) is too late to start grouping tasks? What am I missing?
There used to be wakeup steering, but I'm not sure that still exists in
this version (still need to read beyond the first few patches). It isn't
hard to add.
But I think Tim and Chen have mostly been looking at 'enterprise'
workloads.
> In my head Core Scheduling is already doing what we want. We just need to
> extend it to be a bit more relaxed (best effort rather than completely strict
> for security reasons today). This will be a lot more flexible and will allow
> tasks to be co-located from the get-go. And it will defer the responsibility of
> tagging to userspace. If they do better or worse, it's on them :) It seems you
> already hit a corner case where the grouping was a bad idea and doing some
> magic with thread numbers to alleviate it.
No, Core scheduling does completely the wrong thing. Core scheduling is
set up to do co-scheduling, because that's what was required for that
whole speculation trainwreck. And that is very much not what you want or
need here.
You simply want a preference to co-locate things that use the same data.
Which really is a completely different thing.
> FWIW I have come across cases on mobile world were co-locating on a cluster or
> a 'big' core with big L2 cache can benefit a small group of tasks. So the
> concept is generally beneficial as cache hierarchies are not symmetrical in
> more systems now. Even on symmetrical systems, there can be cases made where
> two small data dependent task can benefit from packing on a single CPU.
Sure, we all know this. pipe-bench is a prime example, it flies if you
co-locate them on the same CPU. It tanks if you pull them apart (except
SMT siblings, those are mostly good too).
> I know this changes the direction being made here; but I strongly believe the
> right way is to extend wake up path rather than lump it solely in LB (IIUC).
You're really going to need both, and LB really is the more complicated
part. On a busy/loaded system, LB will completely wreck things for you
if it doesn't play ball.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 03/21] sched/cache: Introduce helper functions to enforce LLC migration policy
2026-02-19 11:29 ` Peter Zijlstra
@ 2026-02-19 14:48 ` Chen, Yu C
2026-02-19 14:55 ` Peter Zijlstra
0 siblings, 1 reply; 117+ messages in thread
From: Chen, Yu C @ 2026-02-19 14:48 UTC (permalink / raw)
To: Peter Zijlstra, Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
Hi Peter,
On 2/19/2026 7:29 PM, Peter Zijlstra wrote:
> On Tue, Feb 10, 2026 at 02:18:43PM -0800, Tim Chen wrote:
>
>> +/*
>> + * The margin used when comparing utilization.
>> + * is 'util1' noticeably greater than 'util2'
>> + * Derived from capacity_greater().
>> + * Bias is in perentage.
>> + */
>> +/* Allows dst util to be bigger than src util by up to bias percent */
>> +#define util_greater(util1, util2) \
>> + ((util1) * 100 > (util2) * 120)
>
>> + * 20% is the utilization imbalance percentage to decide
>> + * if the preferred LLC is busier than the non-preferred LLC.
>> + * 20 is a little higher than the LLC domain's imbalance_pct
>> + * 17. The hysteresis is used to avoid task bouncing between the
>> + * preferred LLC and the non-preferred LLC.
>
> So not saying this needs changing now, but consider that imbalance_pct
> can be changed through debugfs. Do you want this 120 to be expressed in
> terms of imbalance_pct rather than being hardcoded?
Got it, I will look into adding a margin to imbalance_pct for this
comparison. With this change, I assume we can remove the llc_imb_pct
debugfs entry in patch 19.
thanks,
Chenyu
>
> Anyway, let me read on..
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 03/21] sched/cache: Introduce helper functions to enforce LLC migration policy
2026-02-19 14:48 ` Chen, Yu C
@ 2026-02-19 14:55 ` Peter Zijlstra
0 siblings, 0 replies; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-19 14:55 UTC (permalink / raw)
To: Chen, Yu C
Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Thu, Feb 19, 2026 at 10:48:11PM +0800, Chen, Yu C wrote:
> Hi Peter,
>
> On 2/19/2026 7:29 PM, Peter Zijlstra wrote:
> > On Tue, Feb 10, 2026 at 02:18:43PM -0800, Tim Chen wrote:
> >
> > > +/*
> > > + * The margin used when comparing utilization.
> > > + * is 'util1' noticeably greater than 'util2'
> > > + * Derived from capacity_greater().
> > > + * Bias is in perentage.
> > > + */
> > > +/* Allows dst util to be bigger than src util by up to bias percent */
> > > +#define util_greater(util1, util2) \
> > > + ((util1) * 100 > (util2) * 120)
> >
> > > + * 20% is the utilization imbalance percentage to decide
> > > + * if the preferred LLC is busier than the non-preferred LLC.
> > > + * 20 is a little higher than the LLC domain's imbalance_pct
> > > + * 17. The hysteresis is used to avoid task bouncing between the
> > > + * preferred LLC and the non-preferred LLC.
> >
> > So not saying this needs changing now, but consider that imbalance_pct
> > can be changed through debugfs. Do you want this 120 to be expressed in
> > terms of imbalance_pct rather than being hardcoded?
>
> Got it, I will look into adding a margin to imbalance_pct for this
> comparison. With this change, I assume we can remove the llc_imb_pct
> debugfs entry in patch 19.
I hadn't gotten that far yet. Still trying to read patch 4 :-)
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-10 22:18 ` [PATCH v3 04/21] sched/cache: Make LLC id continuous Tim Chen
` (2 preceding siblings ...)
2026-02-19 11:35 ` Peter Zijlstra
@ 2026-02-19 14:59 ` Peter Zijlstra
2026-02-19 15:20 ` Chen, Yu C
3 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-19 14:59 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Tue, Feb 10, 2026 at 02:18:44PM -0800, Tim Chen wrote:
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index cf643a5ddedd..ca46b5cf7f78 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -20,6 +20,7 @@ void sched_domains_mutex_unlock(void)
> /* Protected by sched_domains_mutex: */
> static cpumask_var_t sched_domains_tmpmask;
> static cpumask_var_t sched_domains_tmpmask2;
> +static int tl_max_llcs;
>
> static int __init sched_debug_setup(char *str)
> {
> @@ -658,7 +659,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
> */
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> -DEFINE_PER_CPU(int, sd_llc_id);
> +DEFINE_PER_CPU(int, sd_llc_id) = -1;
> DEFINE_PER_CPU(int, sd_share_id);
> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
>
> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> per_cpu(sd_llc_size, cpu) = size;
> - per_cpu(sd_llc_id, cpu) = id;
> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>
> sd = lowest_flag_domain(cpu, SD_CLUSTER);
> @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>
> /* Set up domains for CPUs specified by the cpu_map: */
> for_each_cpu(i, cpu_map) {
> - struct sched_domain_topology_level *tl;
> + struct sched_domain_topology_level *tl, *tl_llc = NULL;
> + int lid;
>
> sd = NULL;
> for_each_sd_topology(tl) {
> + int flags = 0;
> +
> + if (tl->sd_flags)
> + flags = (*tl->sd_flags)();
> +
> + if (flags & SD_SHARE_LLC)
> + tl_llc = tl;
>
> sd = build_sched_domain(tl, cpu_map, attr, sd, i);
>
> @@ -2581,6 +2589,39 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> if (cpumask_equal(cpu_map, sched_domain_span(sd)))
> break;
> }
> +
> + lid = per_cpu(sd_llc_id, i);
> + if (lid == -1) {
> + int j;
> +
> + /*
> + * Assign the llc_id to the CPUs that do not
> + * have an LLC.
> + */
Where does this happen? Is this for things like Atom that don't have an
L3 and so we don't set up a LLC domain?
> + if (!tl_llc) {
> + per_cpu(sd_llc_id, i) = tl_max_llcs++;
> +
> + continue;
> + }
> +
> + /* try to reuse the llc_id of its siblings */
> + for_each_cpu(j, tl_llc->mask(tl_llc, i)) {
> + if (i == j)
> + continue;
> +
> + lid = per_cpu(sd_llc_id, j);
> +
> + if (lid != -1) {
> + per_cpu(sd_llc_id, i) = lid;
> +
> + break;
> + }
> + }
> +
> + /* a new LLC is detected */
> + if (lid == -1)
> + per_cpu(sd_llc_id, i) = tl_max_llcs++;
> + }
> }
>
> if (WARN_ON(!topology_span_sane(cpu_map)))
> --
> 2.32.0
>
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-19 14:41 ` Peter Zijlstra
@ 2026-02-19 15:07 ` Chen, Yu C
2026-02-19 18:11 ` Tim Chen
2026-02-20 3:25 ` Qais Yousef
2026-02-19 19:48 ` Qais Yousef
1 sibling, 2 replies; 117+ messages in thread
From: Chen, Yu C @ 2026-02-19 15:07 UTC (permalink / raw)
To: Peter Zijlstra, Qais Yousef
Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Libo Chen, linux-kernel
Hi Peter, Qais,
On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
> On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
>> On 02/10/26 14:18, Tim Chen wrote:
[ ... ]
>>
>> I admit yet to look fully at the series. But I must ask, why are you deferring
>> to load balance and not looking at wake up path? LB should be for corrections.
>> When wake up path is doing wrong decision all the time, LB (which is super slow
>> to react) is too late to start grouping tasks? What am I missing?
>
> There used to be wakeup steering, but I'm not sure that still exists in
> this version (still need to read beyond the first few patches). It isn't
> hard to add.
>
Please let me explain a little more about why we did this in the
load balance path. Yes, the original version implemented cache-aware
scheduling only in the wakeup path. According to our testing, this appeared
to cause some task bouncing issues across LLCs. This was due to conflicts
with the legacy load balancer, which tries to spread tasks to different
LLCs.
So as Peter said, the load balancer should be taken care of anyway. Later,
we kept only the cache aware logic in the load balancer, and the test
results
became much more stable, so we kept it as is. The wakeup path more or less
aggregates the wakees(threads within the same process) within the LLC in
the
wakeup fast path, so we have not changed it for now.
Let me copy the changelog from the previous patch version:
"
In previous versions, aggregation of tasks were done in the
wake up path, without making load balancing paths aware of
LLC (Last-Level-Cache) preference. This led to the following
problems:
1) Aggregation of tasks during wake up led to load imbalance
between LLCs
2) Load balancing tried to even out the load between LLCs
3) Wake up tasks aggregation happened at a faster rate and
load balancing moved tasks in opposite directions, leading
to continuous and excessive task migrations and regressions
in benchmarks like schbench.
In this version, load balancing is made cache-aware. The main
idea of cache-aware load balancing consists of two parts:
1) Identify tasks that prefer to run on their hottest LLC and
move them there.
2) Prevent generic load balancing from moving a task out of
its hottest LLC.
"
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-19 14:59 ` Peter Zijlstra
@ 2026-02-19 15:20 ` Chen, Yu C
2026-02-19 19:20 ` Tim Chen
0 siblings, 1 reply; 117+ messages in thread
From: Chen, Yu C @ 2026-02-19 15:20 UTC (permalink / raw)
To: Peter Zijlstra, Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On 2/19/2026 10:59 PM, Peter Zijlstra wrote:
> On Tue, Feb 10, 2026 at 02:18:44PM -0800, Tim Chen wrote:
>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index cf643a5ddedd..ca46b5cf7f78 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -20,6 +20,7 @@ void sched_domains_mutex_unlock(void)
>> /* Protected by sched_domains_mutex: */
>> static cpumask_var_t sched_domains_tmpmask;
>> static cpumask_var_t sched_domains_tmpmask2;
>> +static int tl_max_llcs;
>>
>> static int __init sched_debug_setup(char *str)
>> {
>> @@ -658,7 +659,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
>> */
>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>> DEFINE_PER_CPU(int, sd_llc_size);
>> -DEFINE_PER_CPU(int, sd_llc_id);
>> +DEFINE_PER_CPU(int, sd_llc_id) = -1;
>> DEFINE_PER_CPU(int, sd_share_id);
>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>> @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
>>
>> rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>> per_cpu(sd_llc_size, cpu) = size;
>> - per_cpu(sd_llc_id, cpu) = id;
>> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>>
>> sd = lowest_flag_domain(cpu, SD_CLUSTER);
>> @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>
>> /* Set up domains for CPUs specified by the cpu_map: */
>> for_each_cpu(i, cpu_map) {
>> - struct sched_domain_topology_level *tl;
>> + struct sched_domain_topology_level *tl, *tl_llc = NULL;
>> + int lid;
>>
>> sd = NULL;
>> for_each_sd_topology(tl) {
>> + int flags = 0;
>> +
>> + if (tl->sd_flags)
>> + flags = (*tl->sd_flags)();
>> +
>> + if (flags & SD_SHARE_LLC)
>> + tl_llc = tl;
>>
>> sd = build_sched_domain(tl, cpu_map, attr, sd, i);
>>
>> @@ -2581,6 +2589,39 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>> if (cpumask_equal(cpu_map, sched_domain_span(sd)))
>> break;
>> }
>> +
>> + lid = per_cpu(sd_llc_id, i);
>> + if (lid == -1) {
>> + int j;
>> +
>> + /*
>> + * Assign the llc_id to the CPUs that do not
>> + * have an LLC.
>> + */
>
> Where does this happen? Is this for things like Atom that don't have an
> L3 and so we don't set up a LLC domain?
>
Yes, for some hybrid platforms, some CPUs on that platforms might not
have L3,
Tim might correct me if I’m wrong. Above code is derived from the
update_top_cache_domain(),
if there is no sched domain with SD_SHARE_LLC, per_cpu(sd_llc_id, cpu)
is set to the
CPU number directly.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-16 7:44 ` K Prateek Nayak
2026-02-17 6:07 ` Chen, Yu C
@ 2026-02-19 15:40 ` Peter Zijlstra
2026-02-20 15:53 ` Chen, Yu C
1 sibling, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-19 15:40 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Tim Chen, Ingo Molnar, Gautham R . Shenoy, Vincent Guittot,
Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Mon, Feb 16, 2026 at 01:14:20PM +0530, K Prateek Nayak wrote:
> > @@ -2581,6 +2589,39 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > if (cpumask_equal(cpu_map, sched_domain_span(sd)))
> > break;
> > }
> > +
> > + lid = per_cpu(sd_llc_id, i);
> > + if (lid == -1) {
> > + int j;
> > +
> > + /*
> > + * Assign the llc_id to the CPUs that do not
> > + * have an LLC.
> > + */
> > + if (!tl_llc) {
> > + per_cpu(sd_llc_id, i) = tl_max_llcs++;
> > +
> > + continue;
> > + }
> > +
> > + /* try to reuse the llc_id of its siblings */
> > + for_each_cpu(j, tl_llc->mask(tl_llc, i)) {
>
>
> My only large concern that remains is the fact that offline CPUs are
> taken out the the tl->mask() which can lead to interesting cases where
> CPUs on same LLC can have different llc_id:
>
> o Boot with maxcpus=1
>
> o Run:
>
> for i in {1..$NRCPUS}; do
> echo 1 > /sys/devices/system/cpu/cpu$i/online;
> echo 0 > /sys/devices/system/cpu/cpu$i/online;
> done
Lol, cute ;-)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c6efa71cf500..aee1be89ab4c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8268,6 +8268,8 @@ static void cpuset_cpu_active(void)
> static void cpuset_cpu_inactive(unsigned int cpu)
> {
> if (!cpuhp_tasks_frozen) {
> + /* XXX: Is this the right spot? */
> + sched_domains_free_llc_id(cpu);
> cpuset_update_active_cpus();
> } else {
> num_cpus_frozen++;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index de5b701c3950..31a8910297c7 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3903,6 +3903,7 @@ static inline bool sched_cache_enabled(void)
> }
> #endif
> extern void init_sched_mm(struct task_struct *p);
> +void sched_domains_free_llc_id(int cpu);
>
> extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
> extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index ca46b5cf7f78..04c1ab489ee2 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -18,6 +18,7 @@ void sched_domains_mutex_unlock(void)
> }
>
> /* Protected by sched_domains_mutex: */
> +static cpumask_var_t sched_domains_llc_id_allocmask;
> static cpumask_var_t sched_domains_tmpmask;
> static cpumask_var_t sched_domains_tmpmask2;
> static int tl_max_llcs;
> @@ -2543,6 +2544,53 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
> return true;
> }
>
> +static int __sched_domains_alloc_llc_id(void)
> +{
> + int lid;
> +
> + lockdep_assert_held(&sched_domains_mutex);
> +
> + lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
> + if (lid >= tl_max_llcs)
> + tl_max_llcs++;
Urgh,. should we not rather track the max lid?
Also, we allocate max_llc sized data structures, if this thing is
'variable' we must also always store a copy of the 'lid' size of the
time of allocation.
> +
> + /*
> + * llc_id space should never grow larger than the
> + * possible number of CPUs in the system.
> + */
> + if (!unlikely(WARN_ON_ONCE(lid >= nr_cpumask_bits)))
> + cpumask_set_cpu(lid, sched_domains_llc_id_allocmask);
__cpumask_set_cpu()
Since you're serializing everything with that sched_domains_mutex, this
need not be an atomic op.
> + return lid;
> +}
> +
> +static void __sched_domains_free_llc_id(int cpu)
> +{
> + int i, lid;
> +
> + lockdep_assert_held(&sched_domains_mutex);
> +
> + lid = per_cpu(sd_llc_id, cpu);
> + if (lid == -1)
> + return;
> +
> + per_cpu(sd_llc_id, cpu) = -1;
> +
> + for_each_online_cpu(i) {
> + /* An online CPU owns the llc_id. */
> + if (per_cpu(sd_llc_id, i) == lid)
> + return;
> + }
> +
> + cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
__cpumask_clear_cpu()
> +}
So this deals with Madadi's issue I suppose.
> +void sched_domains_free_llc_id(int cpu)
> +{
> + sched_domains_mutex_lock();
> + __sched_domains_free_llc_id(cpu);
> + sched_domains_mutex_unlock();
> +}
> +
> /*
> * Build sched domains for a given set of CPUs and attach the sched domains
> * to the individual CPUs
> @@ -2599,7 +2647,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> * have an LLC.
> */
> if (!tl_llc) {
> - per_cpu(sd_llc_id, i) = tl_max_llcs++;
> + per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
>
> continue;
> }
> @@ -2620,7 +2668,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>
> /* a new LLC is detected */
> if (lid == -1)
> - per_cpu(sd_llc_id, i) = tl_max_llcs++;
> + per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
> }
> }
>
> @@ -2798,6 +2846,7 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
> {
> int err;
>
> + zalloc_cpumask_var(&sched_domains_llc_id_allocmask, GFP_KERNEL);
> zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
> zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
> zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
> ---
>
> It doesn't compact tl_max_llcs, but it should promote reuse of llc_id if
> all CPUs of a LLC go offline. I know it is a ridiculous scenario but it
> is possible nonetheless.
>
> I'll let Peter and Valentin be the judge of additional space and
> complexity needed for these bits :-)
It appears straight forward enough I suppose.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-17 8:09 ` K Prateek Nayak
2026-02-17 23:12 ` Tim Chen
2026-02-18 15:11 ` Chen, Yu C
@ 2026-02-19 15:48 ` Peter Zijlstra
2026-02-20 15:22 ` Chen, Yu C
2 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-19 15:48 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Chen, Yu C, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Tim Chen,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
On Tue, Feb 17, 2026 at 01:39:45PM +0530, K Prateek Nayak wrote:
> I'm not sure if this is technically possible but assume following
> topology:
>
> [ LLC: 8-15 ]
> [ SMT: 8,9 ][ SMT: 10,11 ] ... [ SMT: 14,15 ]
>
> and the following series of events:
>
> o All CPUs in LLC are offline to begin with (maxcpus = 1 like scenario).
>
> o CPUs 10-15 are onlined first.
>
> o CPU8 is put in a separate root partition and brought online.
> (XXX: I'm not 100% sure if this is possible in this order)
>
> o build_sched_domains() will bail out at SMT domain since the cpumap
> is covered by tl->mask() and tl_llc = tl_smt.
>
> o llc_id calculation uses the tl_smt->mask() which will not contain
> CPUs 10-15 and CPU8 will get a unique LLC id even though there are
> other online CPUs in the LLC with a different llc_id (!!!)
Yeah, so partitions (including isol_cpus) could wreck things here, since
this is purely about the sched_domains.
You can create N single CPU partitions (isol_cpus does this) and end up
with the same 'problem' that online one at a time loop did. Except this
time it would not be 'wrong'. Since they are single CPU domains, you
also don't get load-balancing, so who cares I suppose. But it will
inflate max_lid.
But suppose you create N/2 partitions (where N is the number of CPUs in
the physical LLC), then you get many individual 'LLC's and
load-balancing inside them. I suppose this is correct, although it does
inflate max_lid somewhat beyond what you would normally expect.
However, most of that space would be wasted, since you're not actually
allowed to migrate to them.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-19 6:12 ` K Prateek Nayak
@ 2026-02-19 15:51 ` Peter Zijlstra
2026-02-20 0:11 ` Tim Chen
1 sibling, 0 replies; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-19 15:51 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Tim Chen, Chen, Yu C, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
Yangyu Chen, Tingyin Duan, Vern Hao, Vern Hao, Len Brown,
Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen,
Josh Don, Gavin Guo, Qais Yousef, Libo Chen, linux-kernel,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
On Thu, Feb 19, 2026 at 11:42:58AM +0530, K Prateek Nayak wrote:
> I don't think we need to special case this out since:
>
> for_each_cpu(j, cpu_coregroup_mask(i)) {
> ...
> }
>
> would bail out if no CPU is set (also CPU "i" would definitely be
> set on it since it must be online) and the "if" after the loop will
> see "lid" as "-1" and DTRT.
So tying lid to coregroup_mask, rather than sched_domains might make
sense. It avoids that partitions nonsense.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-19 11:25 ` Chen, Yu C
@ 2026-02-19 16:10 ` K Prateek Nayak
0 siblings, 0 replies; 117+ messages in thread
From: K Prateek Nayak @ 2026-02-19 16:10 UTC (permalink / raw)
To: Chen, Yu C, Tim Chen
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
Hello Chenyu,
On 2/19/2026 4:55 PM, Chen, Yu C wrote:
>> +static void __sched_domains_free_llc_id(int cpu)
>> +{
>> + int i, lid;
>> +
>> + lockdep_assert_held(&sched_domains_mutex);
>> +
>> + lid = per_cpu(sd_llc_id, cpu);
>> + if (lid == -1)
>> + return;
>> +
>> + per_cpu(sd_llc_id, cpu) = -1;
>> +
>> + for_each_online_cpu(i) {
>
> One minor question: should we only iterate through
> cpu_coregroup_mask(cpu) to check if any sibling CPU
> within this LLC owns the llc_id? If there are no online
> CPUs within this LLC, I assume we should release this
> llc_id.
That should work too! I'm assuming the arch/ side
unlink happens before this in which case we can simply
check cpumask_empty(cpu_coregroup_mask(cpu)).
>
> thanks,
> Chenyu
>
>> + /* An online CPU owns the llc_id. */
>> + if (per_cpu(sd_llc_id, i) == lid)
>> + return;
>> + }
>> +
>> + cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
>
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-10 22:18 ` [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
2026-02-18 17:54 ` Madadi Vineeth Reddy
@ 2026-02-19 16:50 ` Peter Zijlstra
2026-02-19 21:06 ` Tim Chen
1 sibling, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-19 16:50 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Tue, Feb 10, 2026 at 02:18:55PM -0800, Tim Chen wrote:
> +static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
> +{
> + int smt_nr = 1;
> +
> +#ifdef CONFIG_SCHED_SMT
> + if (sched_smt_active())
> + smt_nr = cpumask_weight(cpu_smt_mask(cpu));
cpu_smt_num_threads ?
> +#endif
> +
> + return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
> + per_cpu(sd_llc_size, cpu));
> +}
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-18 17:54 ` Madadi Vineeth Reddy
2026-02-18 21:44 ` Tim Chen
@ 2026-02-19 16:52 ` Peter Zijlstra
2026-02-20 7:02 ` Madadi Vineeth Reddy
2026-02-19 16:55 ` Peter Zijlstra
2 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-19 16:52 UTC (permalink / raw)
To: Madadi Vineeth Reddy
Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Wed, Feb 18, 2026 at 11:24:05PM +0530, Madadi Vineeth Reddy wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d1145997b88d..86b6b08e7e1e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1223,6 +1223,19 @@ static inline bool valid_llc_buf(struct sched_domain *sd,
> > return valid_llc_id(id);
> > }
> >
> > +static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
> > +{
> > + int smt_nr = 1;
> > +
> > +#ifdef CONFIG_SCHED_SMT
> > + if (sched_smt_active())
> > + smt_nr = cpumask_weight(cpu_smt_mask(cpu));
> > +#endif
> > +
> > + return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
> > + per_cpu(sd_llc_size, cpu));
>
>
> On Power10/Power11 with SMT4 and LLC size of 4, this check
> effectively disables cache-aware scheduling for any process.
>
> I raised this point in v1 as well. Increasing the threshold
> doesn't seem like a viable solution either, as that would regress
> hackbench/ebizzy.
>
> Is there a way to make this useful for architectures with small LLC
> sizes? One possible approach we were exploring is to have LLC at a
> hemisphere level that comprise multiple SMT4 cores.
One way forward would be to use a llc-mask instead of a single llc value
for preference. I think this got mentioned before, and I think it makes
sense to do this later.
But once you can have a 'few' LLCs as preference, this constraint
becomes a little easier.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-18 17:54 ` Madadi Vineeth Reddy
2026-02-18 21:44 ` Tim Chen
2026-02-19 16:52 ` Peter Zijlstra
@ 2026-02-19 16:55 ` Peter Zijlstra
2026-02-20 6:40 ` Madadi Vineeth Reddy
2 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-19 16:55 UTC (permalink / raw)
To: Madadi Vineeth Reddy
Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Wed, Feb 18, 2026 at 11:24:05PM +0530, Madadi Vineeth Reddy wrote:
> Is there a way to make this useful for architectures with small LLC
> sizes? One possible approach we were exploring is to have LLC at a
> hemisphere level that comprise multiple SMT4 cores.
Is this hemisphere an actual physical cache level, or would that be
artificial?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-19 15:07 ` Chen, Yu C
@ 2026-02-19 18:11 ` Tim Chen
2026-02-20 3:29 ` Qais Yousef
2026-02-20 3:25 ` Qais Yousef
1 sibling, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-19 18:11 UTC (permalink / raw)
To: Chen, Yu C, Peter Zijlstra, Qais Yousef
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Libo Chen, linux-kernel
On Thu, 2026-02-19 at 23:07 +0800, Chen, Yu C wrote:
> Hi Peter, Qais,
>
> On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
> > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > > On 02/10/26 14:18, Tim Chen wrote:
>
> [ ... ]
>
> > >
> > > I admit yet to look fully at the series. But I must ask, why are you deferring
> > > to load balance and not looking at wake up path? LB should be for corrections.
> > > When wake up path is doing wrong decision all the time, LB (which is super slow
> > > to react) is too late to start grouping tasks? What am I missing?
> >
> > There used to be wakeup steering, but I'm not sure that still exists in
> > this version (still need to read beyond the first few patches). It isn't
> > hard to add.
> >
>
> Please let me explain a little more about why we did this in the
> load balance path. Yes, the original version implemented cache-aware
> scheduling only in the wakeup path. According to our testing, this appeared
> to cause some task bouncing issues across LLCs. This was due to conflicts
> with the legacy load balancer, which tries to spread tasks to different
> LLCs.
> So as Peter said, the load balancer should be taken care of anyway. Later,
> we kept only the cache aware logic in the load balancer, and the test
> results
> became much more stable, so we kept it as is. The wakeup path more or less
> aggregates the wakees(threads within the same process) within the LLC in
> the
> wakeup fast path, so we have not changed it for now.
>
> Let me copy the changelog from the previous patch version:
>
> "
> In previous versions, aggregation of tasks were done in the
> wake up path, without making load balancing paths aware of
> LLC (Last-Level-Cache) preference. This led to the following
> problems:
>
> 1) Aggregation of tasks during wake up led to load imbalance
> between LLCs
> 2) Load balancing tried to even out the load between LLCs
> 3) Wake up tasks aggregation happened at a faster rate and
> load balancing moved tasks in opposite directions, leading
> to continuous and excessive task migrations and regressions
> in benchmarks like schbench.
>
> In this version, load balancing is made cache-aware. The main
> idea of cache-aware load balancing consists of two parts:
>
> 1) Identify tasks that prefer to run on their hottest LLC and
> move them there.
> 2) Prevent generic load balancing from moving a task out of
> its hottest LLC.
> "
>
Another reason why we moved away from doing things in the wake up
path is load imbalance consideration. Wake up path does not have
the most up to date load information in the LLC sched domains as
in the load balance path. So you may actually have everyone rushed
into each's favorite LLC and causes LLC overload. And load balance
will have to undo this. This led to frequent task migrations that
hurts performance.
It is better to consider LLC preference in the load balance path
so we can aggregate tasks while still keeping load imbalance under
control.
Tim
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-19 11:35 ` Peter Zijlstra
@ 2026-02-19 18:17 ` Tim Chen
0 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-19 18:17 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Thu, 2026-02-19 at 12:35 +0100, Peter Zijlstra wrote:
> On Tue, Feb 10, 2026 at 02:18:44PM -0800, Tim Chen wrote:
> > From: Chen Yu <yu.c.chen@intel.com>
> >
> > Introduce an index mapping between CPUs and their LLCs. This provides
> > a continuous per LLC index needed for cache-aware load balancing in
> > later patches.
> >
> > The existing per_cpu llc_id usually points to the first CPU of the
> > LLC domain, which is sparse and unsuitable as an array index. Using
> > llc_id directly would waste memory.
> >
> > With the new mapping, CPUs in the same LLC share a continuous id:
> >
> > per_cpu(llc_id, CPU=0...15) = 0
> > per_cpu(llc_id, CPU=16...31) = 1
> > per_cpu(llc_id, CPU=32...47) = 2
> > ...
> >
> > Once a CPU has been assigned an llc_id, this ID persists even when
> > the CPU is taken offline and brought back online, which can facilitate
> > the management of the ID.
> >
> > Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>
> Note that Tim is the one sending this email, so his SOB should be last.
> It is also fine to have a SOB occur multiple times in a chain.
>
> Please double check all these SOB chains, because I think this isn't the
> first one that isn't right (possibly the very first patch already has
> problems).
Sorry about that. Will correct this on the next version.
Tim
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-19 15:20 ` Chen, Yu C
@ 2026-02-19 19:20 ` Tim Chen
2026-02-19 21:04 ` Tim Chen
0 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-19 19:20 UTC (permalink / raw)
To: Chen, Yu C, Peter Zijlstra
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Thu, 2026-02-19 at 23:20 +0800, Chen, Yu C wrote:
> On 2/19/2026 10:59 PM, Peter Zijlstra wrote:
> > On Tue, Feb 10, 2026 at 02:18:44PM -0800, Tim Chen wrote:
> >
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index cf643a5ddedd..ca46b5cf7f78 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -20,6 +20,7 @@ void sched_domains_mutex_unlock(void)
> > > /* Protected by sched_domains_mutex: */
> > > static cpumask_var_t sched_domains_tmpmask;
> > > static cpumask_var_t sched_domains_tmpmask2;
> > > +static int tl_max_llcs;
> > >
> > > static int __init sched_debug_setup(char *str)
> > > {
> > > @@ -658,7 +659,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
> > > */
> > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > > DEFINE_PER_CPU(int, sd_llc_size);
> > > -DEFINE_PER_CPU(int, sd_llc_id);
> > > +DEFINE_PER_CPU(int, sd_llc_id) = -1;
> > > DEFINE_PER_CPU(int, sd_share_id);
> > > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > > @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
> > >
> > > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> > > per_cpu(sd_llc_size, cpu) = size;
> > > - per_cpu(sd_llc_id, cpu) = id;
> > > rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> > >
> > > sd = lowest_flag_domain(cpu, SD_CLUSTER);
> > > @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > >
> > > /* Set up domains for CPUs specified by the cpu_map: */
> > > for_each_cpu(i, cpu_map) {
> > > - struct sched_domain_topology_level *tl;
> > > + struct sched_domain_topology_level *tl, *tl_llc = NULL;
> > > + int lid;
> > >
> > > sd = NULL;
> > > for_each_sd_topology(tl) {
> > > + int flags = 0;
> > > +
> > > + if (tl->sd_flags)
> > > + flags = (*tl->sd_flags)();
> > > +
> > > + if (flags & SD_SHARE_LLC)
> > > + tl_llc = tl;
> > >
> > > sd = build_sched_domain(tl, cpu_map, attr, sd, i);
> > >
> > > @@ -2581,6 +2589,39 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > > if (cpumask_equal(cpu_map, sched_domain_span(sd)))
> > > break;
> > > }
> > > +
> > > + lid = per_cpu(sd_llc_id, i);
> > > + if (lid == -1) {
> > > + int j;
> > > +
> > > + /*
> > > + * Assign the llc_id to the CPUs that do not
> > > + * have an LLC.
> > > + */
> >
> > Where does this happen? Is this for things like Atom that don't have an
> > L3 and so we don't set up a LLC domain?
> >
>
> Yes, for some hybrid platforms, some CPUs on that platforms might not
> have L3,
> Tim might correct me if I’m wrong. Above code is derived from the
> update_top_cache_domain(),
> if there is no sched domain with SD_SHARE_LLC, per_cpu(sd_llc_id, cpu)
> is set to the
> CPU number directly.
>
That's correct. One example is Meteor Lake where some Atom CPUs don't have
L3 but have only L2. And some Ampere CPUs also have no shared L3.
https://www.spinics.net/lists/kernel/msg5863118.html?utm_source=chatgpt.com
This also reminded me that if we rely on cpu_coregroup_mask for LLC id
assignment, we may be missing out such platforms which need to treat
L2 as the last level cache. So we may need to fallback to cpu_clustergroup_mask
or cpu_smt_mask where applicable.
Tim
> thanks,
> Chenyu
>
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-19 14:41 ` Peter Zijlstra
2026-02-19 15:07 ` Chen, Yu C
@ 2026-02-19 19:48 ` Qais Yousef
2026-02-19 21:47 ` Tim Chen
1 sibling, 1 reply; 117+ messages in thread
From: Qais Yousef @ 2026-02-19 19:48 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Libo Chen, linux-kernel
On 02/19/26 15:41, Peter Zijlstra wrote:
> On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > On 02/10/26 14:18, Tim Chen wrote:
> > > This patch series introduces infrastructure for cache-aware load
> > > balancing, with the goal of co-locating tasks that share data within
> > > the same Last Level Cache (LLC) domain. By improving cache locality,
> > > the scheduler can reduce cache bouncing and cache misses, ultimately
> > > improving data access efficiency. The design builds on the initial
> > > prototype from Peter [1].
> > >
> > > This initial implementation treats threads within the same process as
> > > entities that are likely to share data. During load balancing, the
> >
> > This is a very aggressive assumption. From what I've seen, only few tasks truly
> > share data. Lumping everything in a process together is an easy way to
> > classify, but I think we can do better.
>
> Not without more information. And that is something we can always add
> later. But like you well know, it is an uphill battle to get programs to
> explain/annotate themselves.
Yes. I think we should be able to come up with a daemon to profile a workload
on a machine and come up with a recommendation of tasks that have data
co-dependency.
Note I strongly against programs specifying this themselves. We need to provide
a service that helps with the correct tagging - ie: it is an admin only
operation.
>
> The alternative is sampling things using the PMU, see which process is
> trying to access which data, but that too is non-trivial, not to mention
> it will get people really upset for consuming PMU resources.
I was hoping we can tell which data structures are shared between tasks with
perf?
I am thinking this is not something that need to run continuously. But
disocvered one time off on a machine or once every update. The profiling can be
done once (on demand) I believe.
Still if someone really wants to tag all the tasks for a process to stay
together, I think this is fine if that's what they want.
>
> Starting things with a simple assumption is fine. This can always be
> extended. Gotta start somewhere and all that. It currently groups things
> by mm_struct, but it would be fairly straight forward to allow userspace
> to group tasks manually.
>
> > > scheduler attempts to aggregate such threads onto the same LLC domain
> > > whenever possible.
> >
> > I admit yet to look fully at the series. But I must ask, why are you deferring
> > to load balance and not looking at wake up path? LB should be for corrections.
> > When wake up path is doing wrong decision all the time, LB (which is super slow
> > to react) is too late to start grouping tasks? What am I missing?
>
> There used to be wakeup steering, but I'm not sure that still exists in
> this version (still need to read beyond the first few patches). It isn't
> hard to add.
>
> But I think Tim and Chen have mostly been looking at 'enterprise'
> workloads.
>
> > In my head Core Scheduling is already doing what we want. We just need to
> > extend it to be a bit more relaxed (best effort rather than completely strict
> > for security reasons today). This will be a lot more flexible and will allow
> > tasks to be co-located from the get-go. And it will defer the responsibility of
> > tagging to userspace. If they do better or worse, it's on them :) It seems you
> > already hit a corner case where the grouping was a bad idea and doing some
> > magic with thread numbers to alleviate it.
>
> No, Core scheduling does completely the wrong thing. Core scheduling is
> set up to do co-scheduling, because that's what was required for that
> whole speculation trainwreck. And that is very much not what you want or
> need here.
>
> You simply want a preference to co-locate things that use the same data.
> Which really is a completely different thing.
Hmm. Isn't the infra the same? We have a group of tasks tagged with a cookie
that needs to be co-located. Core scheduling is strict to keep them on the same
physical core, but the concept can be extended to co-locate on LLC or closest
cache?
>
> > FWIW I have come across cases on mobile world were co-locating on a cluster or
> > a 'big' core with big L2 cache can benefit a small group of tasks. So the
> > concept is generally beneficial as cache hierarchies are not symmetrical in
> > more systems now. Even on symmetrical systems, there can be cases made where
> > two small data dependent task can benefit from packing on a single CPU.
>
> Sure, we all know this. pipe-bench is a prime example, it flies if you
> co-locate them on the same CPU. It tanks if you pull them apart (except
> SMT siblings, those are mostly good too).
+1
>
> > I know this changes the direction being made here; but I strongly believe the
> > right way is to extend wake up path rather than lump it solely in LB (IIUC).
>
> You're really going to need both, and LB really is the more complicated
> part. On a busy/loaded system, LB will completely wreck things for you
> if it doesn't play ball.
Yes I wasn't advocating for wake up both only of course. But I didn't read all
the details but I saw no wake up done.
And generally as I think I have been indicating here and there; we do need to
unify the wakeup and LB decision tree. With push lb this unification become
a piece of cake if the wakeup path already handles the case. The current LB
is a big beast. And will be slow to react for many systems.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-19 19:20 ` Tim Chen
@ 2026-02-19 21:04 ` Tim Chen
2026-02-20 17:17 ` Chen, Yu C
0 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-19 21:04 UTC (permalink / raw)
To: Chen, Yu C, Peter Zijlstra
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Thu, 2026-02-19 at 11:20 -0800, Tim Chen wrote:
> On Thu, 2026-02-19 at 23:20 +0800, Chen, Yu C wrote:
> > On 2/19/2026 10:59 PM, Peter Zijlstra wrote:
> > > On Tue, Feb 10, 2026 at 02:18:44PM -0800, Tim Chen wrote:
> > >
> > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > index cf643a5ddedd..ca46b5cf7f78 100644
> > > > --- a/kernel/sched/topology.c
> > > > +++ b/kernel/sched/topology.c
> > > > @@ -20,6 +20,7 @@ void sched_domains_mutex_unlock(void)
> > > > /* Protected by sched_domains_mutex: */
> > > > static cpumask_var_t sched_domains_tmpmask;
> > > > static cpumask_var_t sched_domains_tmpmask2;
> > > > +static int tl_max_llcs;
> > > >
> > > > static int __init sched_debug_setup(char *str)
> > > > {
> > > > @@ -658,7 +659,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
> > > > */
> > > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > > > DEFINE_PER_CPU(int, sd_llc_size);
> > > > -DEFINE_PER_CPU(int, sd_llc_id);
> > > > +DEFINE_PER_CPU(int, sd_llc_id) = -1;
> > > > DEFINE_PER_CPU(int, sd_share_id);
> > > > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > > > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > > > @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
> > > >
> > > > rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
> > > > per_cpu(sd_llc_size, cpu) = size;
> > > > - per_cpu(sd_llc_id, cpu) = id;
> > > > rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> > > >
> > > > sd = lowest_flag_domain(cpu, SD_CLUSTER);
> > > > @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > > >
> > > > /* Set up domains for CPUs specified by the cpu_map: */
> > > > for_each_cpu(i, cpu_map) {
> > > > - struct sched_domain_topology_level *tl;
> > > > + struct sched_domain_topology_level *tl, *tl_llc = NULL;
> > > > + int lid;
> > > >
> > > > sd = NULL;
> > > > for_each_sd_topology(tl) {
> > > > + int flags = 0;
> > > > +
> > > > + if (tl->sd_flags)
> > > > + flags = (*tl->sd_flags)();
> > > > +
> > > > + if (flags & SD_SHARE_LLC)
> > > > + tl_llc = tl;
> > > >
> > > > sd = build_sched_domain(tl, cpu_map, attr, sd, i);
> > > >
> > > > @@ -2581,6 +2589,39 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > > > if (cpumask_equal(cpu_map, sched_domain_span(sd)))
> > > > break;
> > > > }
> > > > +
> > > > + lid = per_cpu(sd_llc_id, i);
> > > > + if (lid == -1) {
> > > > + int j;
> > > > +
> > > > + /*
> > > > + * Assign the llc_id to the CPUs that do not
> > > > + * have an LLC.
> > > > + */
> > >
> > > Where does this happen? Is this for things like Atom that don't have an
> > > L3 and so we don't set up a LLC domain?
> > >
> >
> > Yes, for some hybrid platforms, some CPUs on that platforms might not
> > have L3,
> > Tim might correct me if I’m wrong. Above code is derived from the
> > update_top_cache_domain(),
> > if there is no sched domain with SD_SHARE_LLC, per_cpu(sd_llc_id, cpu)
> > is set to the
> > CPU number directly.
> >
>
> That's correct. One example is Meteor Lake where some Atom CPUs don't have
> L3 but have only L2. And some Ampere CPUs also have no shared L3.
>
> https://www.spinics.net/lists/kernel/msg5863118.html?utm_source=chatgpt.com
>
> This also reminded me that if we rely on cpu_coregroup_mask for LLC id
> assignment, we may be missing out such platforms which need to treat
> L2 as the last level cache. So we may need to fallback to cpu_clustergroup_mask
> or cpu_smt_mask where applicable.
On further inspection of the code, cpu_coregroup_mask will just be the same
as cpu_clustergroup_mask for that case so we should be okay.
Tim
>
> Tim
>
> > thanks,
> > Chenyu
> >
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-19 16:50 ` Peter Zijlstra
@ 2026-02-19 21:06 ` Tim Chen
0 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-19 21:06 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Thu, 2026-02-19 at 17:50 +0100, Peter Zijlstra wrote:
> On Tue, Feb 10, 2026 at 02:18:55PM -0800, Tim Chen wrote:
>
> > +static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
> > +{
> > + int smt_nr = 1;
> > +
> > +#ifdef CONFIG_SCHED_SMT
> > + if (sched_smt_active())
> > + smt_nr = cpumask_weight(cpu_smt_mask(cpu));
>
> cpu_smt_num_threads ?
Yes, cpu_smt_num_threads should work.
Tim
>
> > +#endif
> > +
> > + return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
> > + per_cpu(sd_llc_size, cpu));
> > +}
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-19 2:28 ` Madadi Vineeth Reddy
2026-02-19 14:38 ` Chen, Yu C
@ 2026-02-19 21:12 ` Tim Chen
1 sibling, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-19 21:12 UTC (permalink / raw)
To: Madadi Vineeth Reddy
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Thu, 2026-02-19 at 07:58 +0530, Madadi Vineeth Reddy wrote:
> On 19/02/26 03:14, Tim Chen wrote:
> > On Wed, 2026-02-18 at 23:24 +0530, Madadi Vineeth Reddy wrote:
> > > On 11/02/26 03:48, Tim Chen wrote:
> > > > From: Chen Yu <yu.c.chen@intel.com>
> > > >
> > > >
> > [ .. snip ..]
> >
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index d1145997b88d..86b6b08e7e1e 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -1223,6 +1223,19 @@ static inline bool valid_llc_buf(struct sched_domain *sd,
> > > > return valid_llc_id(id);
> > > > }
> > > >
> > > > +static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
> > > > +{
> > > > + int smt_nr = 1;
> > > > +
> > > > +#ifdef CONFIG_SCHED_SMT
> > > > + if (sched_smt_active())
> > > > + smt_nr = cpumask_weight(cpu_smt_mask(cpu));
> > > > +#endif
> > > > +
> > > > + return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
> > > > + per_cpu(sd_llc_size, cpu));
> > >
> > >
> > > On Power10/Power11 with SMT4 and LLC size of 4, this check
> > > effectively disables cache-aware scheduling for any process.
> >
> > There are 4 cores per LLC, with 4 SMT per core? In that case, once we have more than
> > 4 running threads and there's another idle LLC available, seems
> > like putting the additional thread on a different LLC is the
> > right thing to do as threads sharing a core will usually be much
> > slower.
> >
> > But when number of threads are under 4, we should still be
> > doing aggregation.
> >
> > Perhaps I am misunderstanding your topology.
>
> There is only one core per LLC whose size is 4 CPUs.
> So, mm->sc_stat.nr_running_avg can't be >= 1 for
> cache aware scheduling to be enabled.
If there is only 1 core, and mm->sc_stat.nr_running_avg > 1,
wouldn't it be better to spread the tasks among the cores with
normal load balance, instead of having threads aggregated
fighting for the resource of a single core, i.e. run without
cache aware scheduling?
Tim
>
> Thanks,
> Vineeth
>
> >
> > Tim
> >
> > >
> > > I raised this point in v1 as well. Increasing the threshold
> > > doesn't seem like a viable solution either, as that would regress
> > > hackbench/ebizzy.
> > >
> > > Is there a way to make this useful for architectures with small LLC
> > > sizes? One possible approach we were exploring is to have LLC at a
> > > hemisphere level that comprise multiple SMT4 cores.
> > >
> > > Thanks,
> > > Vineeth
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-19 19:48 ` Qais Yousef
@ 2026-02-19 21:47 ` Tim Chen
2026-02-20 3:41 ` Qais Yousef
0 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-19 21:47 UTC (permalink / raw)
To: Qais Yousef, Peter Zijlstra
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Libo Chen, linux-kernel
On Thu, 2026-02-19 at 19:48 +0000, Qais Yousef wrote:
> On 02/19/26 15:41, Peter Zijlstra wrote:
> > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > > On 02/10/26 14:18, Tim Chen wrote:
> > > > This patch series introduces infrastructure for cache-aware load
> > > > balancing, with the goal of co-locating tasks that share data within
> > > > the same Last Level Cache (LLC) domain. By improving cache locality,
> > > > the scheduler can reduce cache bouncing and cache misses, ultimately
> > > > improving data access efficiency. The design builds on the initial
> > > > prototype from Peter [1].
> > > >
> > > > This initial implementation treats threads within the same process as
> > > > entities that are likely to share data. During load balancing, the
> > >
> > > This is a very aggressive assumption. From what I've seen, only few tasks truly
> > > share data. Lumping everything in a process together is an easy way to
> > > classify, but I think we can do better.
> >
> > Not without more information. And that is something we can always add
> > later. But like you well know, it is an uphill battle to get programs to
> > explain/annotate themselves.
>
> Yes. I think we should be able to come up with a daemon to profile a workload
> on a machine and come up with a recommendation of tasks that have data
> co-dependency.
>
> Note I strongly against programs specifying this themselves. We need to provide
> a service that helps with the correct tagging - ie: it is an admin only
> operation.
>
> >
> > The alternative is sampling things using the PMU, see which process is
> > trying to access which data, but that too is non-trivial, not to mention
> > it will get people really upset for consuming PMU resources.
>
> I was hoping we can tell which data structures are shared between tasks with
> perf?
>
> I am thinking this is not something that need to run continuously. But
> disocvered one time off on a machine or once every update. The profiling can be
> done once (on demand) I believe.
>
> Still if someone really wants to tag all the tasks for a process to stay
> together, I think this is fine if that's what they want.
I can envision that with tagging tasks with the same cookie that's analogous
to what we are doing for core scheduling. Or grouping tasks by tagging a
cgroup.
>
> >
> > Starting things with a simple assumption is fine. This can always be
> > extended. Gotta start somewhere and all that. It currently groups things
> > by mm_struct, but it would be fairly straight forward to allow userspace
> > to group tasks manually.
> >
> > > > scheduler attempts to aggregate such threads onto the same LLC domain
> > > > whenever possible.
> > >
> > > I admit yet to look fully at the series. But I must ask, why are you deferring
> > > to load balance and not looking at wake up path? LB should be for corrections.
> > > When wake up path is doing wrong decision all the time, LB (which is super slow
> > > to react) is too late to start grouping tasks? What am I missing?
> >
> > There used to be wakeup steering, but I'm not sure that still exists in
> > this version (still need to read beyond the first few patches). It isn't
> > hard to add.
> >
> > But I think Tim and Chen have mostly been looking at 'enterprise'
> > workloads.
> >
> > > In my head Core Scheduling is already doing what we want. We just need to
> > > extend it to be a bit more relaxed (best effort rather than completely strict
> > > for security reasons today). This will be a lot more flexible and will allow
> > > tasks to be co-located from the get-go. And it will defer the responsibility of
> > > tagging to userspace. If they do better or worse, it's on them :) It seems you
> > > already hit a corner case where the grouping was a bad idea and doing some
> > > magic with thread numbers to alleviate it.
> >
> > No, Core scheduling does completely the wrong thing. Core scheduling is
> > set up to do co-scheduling, because that's what was required for that
> > whole speculation trainwreck. And that is very much not what you want or
> > need here.
> >
> > You simply want a preference to co-locate things that use the same data.
> > Which really is a completely different thing.
>
> Hmm. Isn't the infra the same? We have a group of tasks tagged with a cookie
> that needs to be co-located. Core scheduling is strict to keep them on the same
> physical core, but the concept can be extended to co-locate on LLC or closest
> cache?
>
In my understanding, core scheduling doesn't try to place the tasks
with the same cookie on the same core, but the tasks can safely
be scheduled together in SMTs on a core.
However, we can certainly use a similar cookie mechanism to indicate
tasks should be scheduled close to each other cache wise.
> >
> > > FWIW I have come across cases on mobile world were co-locating on a cluster or
> > > a 'big' core with big L2 cache can benefit a small group of tasks. So the
> > > concept is generally beneficial as cache hierarchies are not symmetrical in
> > > more systems now. Even on symmetrical systems, there can be cases made where
> > > two small data dependent task can benefit from packing on a single CPU.
> >
> > Sure, we all know this. pipe-bench is a prime example, it flies if you
> > co-locate them on the same CPU. It tanks if you pull them apart (except
> > SMT siblings, those are mostly good too).
>
> +1
>
> >
> > > I know this changes the direction being made here; but I strongly believe the
> > > right way is to extend wake up path rather than lump it solely in LB (IIUC).
> >
> > You're really going to need both, and LB really is the more complicated
> > part. On a busy/loaded system, LB will completely wreck things for you
> > if it doesn't play ball.
>
> Yes I wasn't advocating for wake up both only of course. But I didn't read all
> the details but I saw no wake up done.
>
> And generally as I think I have been indicating here and there; we do need to
> unify the wakeup and LB decision tree. With push lb this unification become
> a piece of cake if the wakeup path already handles the case. The current LB
> is a big beast. And will be slow to react for many systems.
I think as long as we have up to date information on load at the time of push
in push lb, so we don't cause over aggregation and too much load imbalance,
it will be viable to make such aggregation at wake up.
Tim
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-19 6:12 ` K Prateek Nayak
2026-02-19 15:51 ` Peter Zijlstra
@ 2026-02-20 0:11 ` Tim Chen
1 sibling, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-20 0:11 UTC (permalink / raw)
To: K Prateek Nayak, Chen, Yu C
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Peter Zijlstra,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar
On Thu, 2026-02-19 at 11:42 +0530, K Prateek Nayak wrote:
> Hello Tim,
>
> Thank you for the patch.
>
> On 2/19/2026 4:51 AM, Tim Chen wrote:
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 9848de949afa..4ddf54ab9cf7 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -987,6 +987,7 @@ config SCHED_CACHE
> > bool "Cache aware load balance"
> > default y
> > depends on SMP
> > + depends on SCHED_MC
> > help
> > When enabled, the scheduler will attempt to aggregate tasks from
> > the same process onto a single Last Level Cache (LLC) domain when
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 48626c81ba8e..75ba4e0bfcd3 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -8474,6 +8474,8 @@ int sched_cpu_deactivate(unsigned int cpu)
> > */
> > synchronize_rcu();
> >
> > + sched_domains_free_llc_id(cpu);
> > +
> > sched_set_rq_offline(rq, cpu);
> >
> > scx_rq_deactivate(rq);
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 6cbc56e9adfc..04f42526e6f0 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -3862,6 +3862,7 @@ static inline bool sched_cache_enabled(void)
> > extern void sched_cache_active_set_unlocked(void);
> > #endif
> > extern void init_sched_mm(struct task_struct *p);
> > +void sched_domains_free_llc_id(int cpu);
> >
> > extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
> > extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 580fb2fbc900..5e59340ad9a9 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -18,6 +18,7 @@ void sched_domains_mutex_unlock(void)
> > }
> >
> > /* Protected by sched_domains_mutex: */
> > +static cpumask_var_t sched_domains_llc_id_allocmask;
> > static cpumask_var_t sched_domains_tmpmask;
> > static cpumask_var_t sched_domains_tmpmask2;
> > static int tl_max_llcs;
> > @@ -2590,6 +2591,57 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
> > return true;
> > }
> >
> > +static int __sched_domains_alloc_llc_id(void)
> > +{
> > + int lid;
> > +
> > + lockdep_assert_held(&sched_domains_mutex);
> > +
> > + lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
> > + if (lid >= tl_max_llcs)
> > + tl_max_llcs = lid + 1;
> > +
> > + /*
> > + * llc_id space should never grow larger than the
> > + * possible number of CPUs in the system.
> > + */
> > + if (!unlikely(WARN_ON_ONCE(lid >= nr_cpumask_bits)))
> > + cpumask_set_cpu(lid, sched_domains_llc_id_allocmask);
> > + return lid;
> > +}
> > +
> > +static void __sched_domains_free_llc_id(int cpu)
> > +{
> > + int i, lid;
> > +
> > + lockdep_assert_held(&sched_domains_mutex);
> > +
> > + lid = per_cpu(sd_llc_id, cpu);
> > + if (lid == -1)
> > + return;
> > +
> > + per_cpu(sd_llc_id, cpu) = -1;
> > +
> > + for_each_online_cpu(i) {
> > + /* An online CPU owns the llc_id. */
> > + if (per_cpu(sd_llc_id, i) == lid)
> > + return;
> > + }
>
> We should perhaps warn and skip clearing lid from cpumask if lid was
> found to be larger than "nr_cpumask_bits". Shouldn't happen but just
> as a precaution.
Will do
>
> > +
> > + cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
> > +
> > + /* shrink max LLC size to save memory */
> > + if (lid == tl_max_llcs - 1)
> > + lid = tl_max_llcs--;
>
> No need to assign the local "lid" variable here; Simple decrement
> should do.
Good point
>
> > +}
> > +
> > +void sched_domains_free_llc_id(int cpu)
> > +{
> > + sched_domains_mutex_lock();
> > + __sched_domains_free_llc_id(cpu);
> > + sched_domains_mutex_unlock();
> > +}
> > +
> > /*
> > * Build sched domains for a given set of CPUs and attach the sched domains
> > * to the individual CPUs
> > @@ -2615,18 +2667,11 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >
> > /* Set up domains for CPUs specified by the cpu_map: */
> > for_each_cpu(i, cpu_map) {
> > - struct sched_domain_topology_level *tl, *tl_llc = NULL;
> > + struct sched_domain_topology_level *tl;
> > int lid;
> >
> > sd = NULL;
> > for_each_sd_topology(tl) {
> > - int flags = 0;
> > -
> > - if (tl->sd_flags)
> > - flags = (*tl->sd_flags)();
> > -
> > - if (flags & SD_SHARE_LLC)
> > - tl_llc = tl;
> >
> > sd = build_sched_domain(tl, cpu_map, attr, sd, i);
> >
> > @@ -2642,18 +2687,19 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > if (lid == -1) {
> > int j;
> >
> > + j = cpumask_first(cpu_coregroup_mask(i));
> > /*
> > * Assign the llc_id to the CPUs that do not
> > * have an LLC.
> > */
> > - if (!tl_llc) {
> > - per_cpu(sd_llc_id, i) = tl_max_llcs++;
> > + if (j >= nr_cpu_ids) {
> > + per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
> >
> > continue;
> > }
>
> I don't think we need to special case this out since:
>
> for_each_cpu(j, cpu_coregroup_mask(i)) {
> ...
> }
>
> would bail out if no CPU is set (also CPU "i" would definitely be
> set on it since it must be online) and the "if" after the loop will
> see "lid" as "-1" and DTRT.
That's right. Will take out the non-needed code.
Also found out that cpu_coregroup_mask() is not defined for config
without CONFIG_SMP. So will put the llc id assignment code under
CONFIG_SMP.
Thanks for the code reviews and suggestions.
Tim
>
> >
> > /* try to reuse the llc_id of its siblings */
> > - for_each_cpu(j, tl_llc->mask(tl_llc, i)) {
> > + for (; j < nr_cpu_ids; j = cpumask_next(j, cpu_coregroup_mask(i))) {
> > if (i == j)
> > continue;
> >
> > @@ -2668,7 +2714,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >
> > /* a new LLC is detected */
> > if (lid == -1)
> > - per_cpu(sd_llc_id, i) = tl_max_llcs++;
> > + per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
> > }
> > }
> >
> > @@ -2869,6 +2915,7 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
> > {
> > int err;
> >
> > + zalloc_cpumask_var(&sched_domains_llc_id_allocmask, GFP_KERNEL);
> > zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
> > zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
> > zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-19 15:07 ` Chen, Yu C
2026-02-19 18:11 ` Tim Chen
@ 2026-02-20 3:25 ` Qais Yousef
2026-02-21 2:48 ` Chen, Yu C
1 sibling, 1 reply; 117+ messages in thread
From: Qais Yousef @ 2026-02-20 3:25 UTC (permalink / raw)
To: Chen, Yu C
Cc: Peter Zijlstra, Tim Chen, Ingo Molnar, K Prateek Nayak,
Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
Yangyu Chen, Tingyin Duan, Vern Hao, Vern Hao, Len Brown,
Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen,
Josh Don, Gavin Guo, Libo Chen, linux-kernel
On 02/19/26 23:07, Chen, Yu C wrote:
> Hi Peter, Qais,
>
> On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
> > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > > On 02/10/26 14:18, Tim Chen wrote:
>
> [ ... ]
>
> > >
> > > I admit yet to look fully at the series. But I must ask, why are you deferring
> > > to load balance and not looking at wake up path? LB should be for corrections.
> > > When wake up path is doing wrong decision all the time, LB (which is super slow
> > > to react) is too late to start grouping tasks? What am I missing?
> >
> > There used to be wakeup steering, but I'm not sure that still exists in
> > this version (still need to read beyond the first few patches). It isn't
> > hard to add.
> >
>
> Please let me explain a little more about why we did this in the
> load balance path. Yes, the original version implemented cache-aware
> scheduling only in the wakeup path. According to our testing, this appeared
> to cause some task bouncing issues across LLCs. This was due to conflicts
> with the legacy load balancer, which tries to spread tasks to different
> LLCs.
> So as Peter said, the load balancer should be taken care of anyway. Later,
> we kept only the cache aware logic in the load balancer, and the test
> results
Yes, we need both. My concern is that the origin is for wake up path to keep
tasks placed correctly as most task wake up and sleep often and this is the
common case. If the decision tree is not unified, we will have problems. And
this is not a specific problem to doing placement based on memory dependency.
We need to extend the wake up path to do placement based on latency. Placement
based on energy (EAS) has the same problem too. It disabled LB altogether,
which is a problem we are trying to fix if you saw the other discussion about
overutilized handling. Load balancer can destroy energy balance easily and it
has no notion of how to distribute based on energy. This is a recurring theme
for any new task placement decision that is not purely based on load. The LB
will wreck havoc.
> became much more stable, so we kept it as is. The wakeup path more or less
> aggregates the wakees(threads within the same process) within the LLC in the
> wakeup fast path, so we have not changed it for now.
How expensive is it to use the new push lb, which unifies the decision with
wake up path, to detect these bad task placement and steer them back to the
right LLC? I think if we can construct the trigger right, we can simplify the
load balance to keep tagged tasks within the same LLC much easier. In my view
this bad task placement is just a new type of misfit where a task has strayed
from its group for whatever reason at wake up and it is not sleeping and waking
up again to be placed back with its clan - assuming the conditions has changed
to warrant the move - which the wake up path should handle anyway.
FWIW, I have been experimenting to use push lb to keep regular LB off and rely
solely on it to manage the important corner cases (including overloaded one)
- and seeing *very* promising results. But the systems I work with are small
compared to yours.
But essentially if we can construct the system to keep the wakeup path (via
regular sleep/wakeup cycle and push lb) maintain the system relatively balanced
and delay regular LB for when we need to do large intervention, we can simplify
the problem space significantly IMHO. If the LB had to kick in, then the delays
of not finding enough bandwidth to run are larger than the delays of not
sharing the hottest LLC. IOW, keep the regular LB as-is for true load balance
and handle the small exceptions via natural sleep/wakeup cycle or push lb.
>
> Let me copy the changelog from the previous patch version:
>
> "
> In previous versions, aggregation of tasks were done in the
> wake up path, without making load balancing paths aware of
> LLC (Last-Level-Cache) preference. This led to the following
> problems:
>
> 1) Aggregation of tasks during wake up led to load imbalance
> between LLCs
> 2) Load balancing tried to even out the load between LLCs
> 3) Wake up tasks aggregation happened at a faster rate and
> load balancing moved tasks in opposite directions, leading
> to continuous and excessive task migrations and regressions
> in benchmarks like schbench.
Note this is an artefact of tagging all tasks belonging to the process as
co-dependent. So somehow this is a case of shooting one self in the foot
because processes with large number of tasks will create large imbalances and
will start to require special handling. I guess the question, were they really
that packed which means the steering logic needed to relax a little bit and say
hey, this is an overcommit I must spill to the other LLCs, or was it really
okay to pack them all in one LLC and LB was overzealous to kick in and needed
to be aware the new case is not really a problem that requires its
intervention?
>
> In this version, load balancing is made cache-aware. The main
> idea of cache-aware load balancing consists of two parts:
I think this might work under the conditions you care about. But will be hard
to generalize. But I might need to go and read more.
Note I am mainly concerned because the wake up path can't stay based purely on
load forever and need to be able to do smarter decisions (latency being the
most important one in the horizon). And they all will hit this problem. I think
we need to find a good recipe for how to handle these problems in general.
I don't think we can extend the LB to be energy aware, latency aware, cache
aware etc without hitting a lot of hurdles. And it is too slow to react.
>
> 1) Identify tasks that prefer to run on their hottest LLC and
> move them there.
> 2) Prevent generic load balancing from moving a task out of
> its hottest LLC.
Isn't this 2nd part the fix to the wake up problem you faced? 1 should
naturally be happening at wake up. And for random long running strayed tasks,
I believe push lb is an easier way to manage them.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-19 18:11 ` Tim Chen
@ 2026-02-20 3:29 ` Qais Yousef
2026-02-20 9:43 ` Peter Zijlstra
2026-02-20 18:14 ` Tim Chen
0 siblings, 2 replies; 117+ messages in thread
From: Qais Yousef @ 2026-02-20 3:29 UTC (permalink / raw)
To: Tim Chen
Cc: Chen, Yu C, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
Yangyu Chen, Tingyin Duan, Vern Hao, Vern Hao, Len Brown,
Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen,
Josh Don, Gavin Guo, Libo Chen, linux-kernel
On 02/19/26 10:11, Tim Chen wrote:
> On Thu, 2026-02-19 at 23:07 +0800, Chen, Yu C wrote:
> > Hi Peter, Qais,
> >
> > On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
> > > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > > > On 02/10/26 14:18, Tim Chen wrote:
> >
> > [ ... ]
> >
> > > >
> > > > I admit yet to look fully at the series. But I must ask, why are you deferring
> > > > to load balance and not looking at wake up path? LB should be for corrections.
> > > > When wake up path is doing wrong decision all the time, LB (which is super slow
> > > > to react) is too late to start grouping tasks? What am I missing?
> > >
> > > There used to be wakeup steering, but I'm not sure that still exists in
> > > this version (still need to read beyond the first few patches). It isn't
> > > hard to add.
> > >
> >
> > Please let me explain a little more about why we did this in the
> > load balance path. Yes, the original version implemented cache-aware
> > scheduling only in the wakeup path. According to our testing, this appeared
> > to cause some task bouncing issues across LLCs. This was due to conflicts
> > with the legacy load balancer, which tries to spread tasks to different
> > LLCs.
> > So as Peter said, the load balancer should be taken care of anyway. Later,
> > we kept only the cache aware logic in the load balancer, and the test
> > results
> > became much more stable, so we kept it as is. The wakeup path more or less
> > aggregates the wakees(threads within the same process) within the LLC in
> > the
> > wakeup fast path, so we have not changed it for now.
> >
> > Let me copy the changelog from the previous patch version:
> >
> > "
> > In previous versions, aggregation of tasks were done in the
> > wake up path, without making load balancing paths aware of
> > LLC (Last-Level-Cache) preference. This led to the following
> > problems:
> >
> > 1) Aggregation of tasks during wake up led to load imbalance
> > between LLCs
> > 2) Load balancing tried to even out the load between LLCs
> > 3) Wake up tasks aggregation happened at a faster rate and
> > load balancing moved tasks in opposite directions, leading
> > to continuous and excessive task migrations and regressions
> > in benchmarks like schbench.
> >
> > In this version, load balancing is made cache-aware. The main
> > idea of cache-aware load balancing consists of two parts:
> >
> > 1) Identify tasks that prefer to run on their hottest LLC and
> > move them there.
> > 2) Prevent generic load balancing from moving a task out of
> > its hottest LLC.
> > "
> >
>
> Another reason why we moved away from doing things in the wake up
> path is load imbalance consideration. Wake up path does not have
> the most up to date load information in the LLC sched domains as
> in the load balance path. So you may actually have everyone rushed
What's the reason wake up doesn't have the latest info? Is this a limitation of
these large systems where stats updates are too expensive to do? Is it not
fixable at all?
> into each's favorite LLC and causes LLC overload. And load balance
> will have to undo this. This led to frequent task migrations that
> hurts performance.
>
> It is better to consider LLC preference in the load balance path
> so we can aggregate tasks while still keeping load imbalance under
> control.
>
> Tim
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-19 21:47 ` Tim Chen
@ 2026-02-20 3:41 ` Qais Yousef
2026-02-20 8:45 ` Peter Zijlstra
0 siblings, 1 reply; 117+ messages in thread
From: Qais Yousef @ 2026-02-20 3:41 UTC (permalink / raw)
To: Tim Chen
Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Libo Chen, linux-kernel
On 02/19/26 13:47, Tim Chen wrote:
> > > > I know this changes the direction being made here; but I strongly believe the
> > > > right way is to extend wake up path rather than lump it solely in LB (IIUC).
> > >
> > > You're really going to need both, and LB really is the more complicated
> > > part. On a busy/loaded system, LB will completely wreck things for you
> > > if it doesn't play ball.
> >
> > Yes I wasn't advocating for wake up both only of course. But I didn't read all
> > the details but I saw no wake up done.
> >
> > And generally as I think I have been indicating here and there; we do need to
> > unify the wakeup and LB decision tree. With push lb this unification become
> > a piece of cake if the wakeup path already handles the case. The current LB
> > is a big beast. And will be slow to react for many systems.
>
> I think as long as we have up to date information on load at the time of push
> in push lb, so we don't cause over aggregation and too much load imbalance,
> it will be viable to make such aggregation at wake up.
IMHO I see people are constantly tripping over task placement being too simple
and need smarter decision making process. I think Vincent's proposal is spot on
to help us handle all these situations simply with the added bonus of it being
a lot more reactive. Going down this rabbit hole is worthwhile and will benefit
us in the long run to handle more cases.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-19 16:55 ` Peter Zijlstra
@ 2026-02-20 6:40 ` Madadi Vineeth Reddy
2026-02-20 9:53 ` Peter Zijlstra
0 siblings, 1 reply; 117+ messages in thread
From: Madadi Vineeth Reddy @ 2026-02-20 6:40 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Madadi Vineeth Reddy
Hi Peter,
On 19/02/26 22:25, Peter Zijlstra wrote:
> On Wed, Feb 18, 2026 at 11:24:05PM +0530, Madadi Vineeth Reddy wrote:
>> Is there a way to make this useful for architectures with small LLC
>> sizes? One possible approach we were exploring is to have LLC at a
>> hemisphere level that comprise multiple SMT4 cores.
>
> Is this hemisphere an actual physical cache level, or would that be
> artificial?
It's artificial. There is no cache being shared at this level but this is
still the level where some amount of cache-snooping takes place and it is
relatively faster to access the data from the caches of the cores
within this domain.
We verified with this producer consumer workload where the producer
and consumer threads placed in the same hemisphere showed measurably
better latency compared to cross-hemisphere placement.
Thanks,
Vineeth
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-19 16:52 ` Peter Zijlstra
@ 2026-02-20 7:02 ` Madadi Vineeth Reddy
0 siblings, 0 replies; 117+ messages in thread
From: Madadi Vineeth Reddy @ 2026-02-20 7:02 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Madadi Vineeth Reddy
On 19/02/26 22:22, Peter Zijlstra wrote:
> On Wed, Feb 18, 2026 at 11:24:05PM +0530, Madadi Vineeth Reddy wrote:
>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index d1145997b88d..86b6b08e7e1e 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1223,6 +1223,19 @@ static inline bool valid_llc_buf(struct sched_domain *sd,
>>> return valid_llc_id(id);
>>> }
>>>
>>> +static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>>> +{
>>> + int smt_nr = 1;
>>> +
>>> +#ifdef CONFIG_SCHED_SMT
>>> + if (sched_smt_active())
>>> + smt_nr = cpumask_weight(cpu_smt_mask(cpu));
>>> +#endif
>>> +
>>> + return !fits_capacity((mm->sc_stat.nr_running_avg * smt_nr),
>>> + per_cpu(sd_llc_size, cpu));
>>
>>
>> On Power10/Power11 with SMT4 and LLC size of 4, this check
>> effectively disables cache-aware scheduling for any process.
>>
>> I raised this point in v1 as well. Increasing the threshold
>> doesn't seem like a viable solution either, as that would regress
>> hackbench/ebizzy.
>>
>> Is there a way to make this useful for architectures with small LLC
>> sizes? One possible approach we were exploring is to have LLC at a
>> hemisphere level that comprise multiple SMT4 cores.
>
> One way forward would be to use a llc-mask instead of a single llc value
> for preference. I think this got mentioned before, and I think it makes
> sense to do this later.
>
> But once you can have a 'few' LLCs as preference, this constraint
> becomes a little easier.
Yes, that makes sense. Spanning the llc-mask across multiple cores
in a hemisphere for preference would relax this condition.
We will explore how this can be incorporated. Thanks for taking a
look.
Thanks,
Vineeth
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-20 3:41 ` Qais Yousef
@ 2026-02-20 8:45 ` Peter Zijlstra
2026-02-24 3:31 ` Qais Yousef
0 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-20 8:45 UTC (permalink / raw)
To: Qais Yousef
Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Libo Chen, linux-kernel
On Fri, Feb 20, 2026 at 03:41:27AM +0000, Qais Yousef wrote:
> IMHO I see people are constantly tripping over task placement being too simple
> and need smarter decision making process.
So at the same time we're always having trouble because its too
expensive for some.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-20 3:29 ` Qais Yousef
@ 2026-02-20 9:43 ` Peter Zijlstra
2026-02-24 2:49 ` Qais Yousef
2026-02-20 18:14 ` Tim Chen
1 sibling, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-20 9:43 UTC (permalink / raw)
To: Qais Yousef
Cc: Tim Chen, Chen, Yu C, Ingo Molnar, K Prateek Nayak,
Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
Yangyu Chen, Tingyin Duan, Vern Hao, Vern Hao, Len Brown,
Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen,
Josh Don, Gavin Guo, Libo Chen, linux-kernel
On Fri, Feb 20, 2026 at 03:29:41AM +0000, Qais Yousef wrote:
> What's the reason wake up doesn't have the latest info? Is this a limitation of
> these large systems where stats updates are too expensive to do? Is it not
> fixable at all?
Scalability is indeed the main problem. The periodic load-balancer, by
virtue of being 'slow' has two advantages:
- the cost of aggregating the numbers is amortized by the relative low
frequency of aggregation
- it can work with averages; it is less concerned with immediate
spikes.
This obviously has the exact inverse set of problems in that it is not
able to deal with immediate/short term issues.
Anyway, we're already at the point where EAS wakeup path is getting far
too expensive for the current set of hardware. While we started with a
handful of asymmetric CPUs, we're now pushing 32 CPUs or so.
(Look at Intel Nova Lake speculation online, that's supposedly going to
get us 2 dies of 8P+16E with another 4 bonus weaklings on the south
bridge or something, for a grand total of 52 asymmetric CPUs of 3 kinds)
Then consider:
- Intel Granite Rapids-SP at 8*86 cores for 688 cores / 1376 threads.
- AMD Prometheus at 2*192 cores with 384 cores / 768 threads. These
are silly number of CPUs.
- Power10, it is something like 16 sockets, 16 cores per socket, 8
threads per core for a mere 2048 threads.
Now, these are the extreme end of the spectrum systems, 'nobody' will
actually have them, but in a few generations they'll seem small again.
So whatever we build now, will have to deal with silly numbers of CPUs.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-20 6:40 ` Madadi Vineeth Reddy
@ 2026-02-20 9:53 ` Peter Zijlstra
2026-02-24 9:42 ` Madadi Vineeth Reddy
0 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-20 9:53 UTC (permalink / raw)
To: Madadi Vineeth Reddy
Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Fri, Feb 20, 2026 at 12:10:21PM +0530, Madadi Vineeth Reddy wrote:
> Hi Peter,
>
> On 19/02/26 22:25, Peter Zijlstra wrote:
> > On Wed, Feb 18, 2026 at 11:24:05PM +0530, Madadi Vineeth Reddy wrote:
> >> Is there a way to make this useful for architectures with small LLC
> >> sizes? One possible approach we were exploring is to have LLC at a
> >> hemisphere level that comprise multiple SMT4 cores.
> >
> > Is this hemisphere an actual physical cache level, or would that be
> > artificial?
>
> It's artificial. There is no cache being shared at this level but this is
> still the level where some amount of cache-snooping takes place and it is
> relatively faster to access the data from the caches of the cores
> within this domain.
>
> We verified with this producer consumer workload where the producer
> and consumer threads placed in the same hemisphere showed measurably
> better latency compared to cross-hemisphere placement.
So I just read the Power10 Wikipedia entry; that seems to suggest there
actually is a significant L3 at the hemisphere level.
That thing states that Power10 has:
- 16 cores in two hemispheres of 8 cores each.
- each core has 2M L2 cache
- each hemi has 64M of L3 cache
Then there appears to be a 'funny' in that there's always one 'dead'
core, so you end up with 8+7, and the small hemi looses an 8M L3 slice
due to that.
Now, I'm just reading a Wiki pages written by a random person on the
interweb, so perhaps this is wrong (in which case I would suggest you
get someone from IBM to go and edit that page and provide references),
or there has been a miscommunication somewhere else, and perhaps there
really is L3 at the hemi level, and arch/powerpc/ 'forgot' to expose
that :-)
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 07/21] sched/cache: Introduce per CPU's tasks LLC preference counter
2026-02-10 22:18 ` [PATCH v3 07/21] sched/cache: Introduce per CPU's tasks LLC preference counter Tim Chen
@ 2026-02-20 10:45 ` Peter Zijlstra
2026-02-20 16:57 ` Chen, Yu C
0 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-20 10:45 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
On Tue, Feb 10, 2026 at 02:18:47PM -0800, Tim Chen wrote:
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index a4e2fb31f2fd..3aa6c101b2e4 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -102,6 +102,10 @@ struct sched_domain {
> u64 max_newidle_lb_cost;
> unsigned long last_decay_max_lb_cost;
>
> +#ifdef CONFIG_SCHED_CACHE
> + unsigned int *pf;
So I'm all for short names; but perhaps this could be better. When
reading this my brain went page-fault, and then WTF :-)
> +#endif
> +
> #ifdef CONFIG_SCHEDSTATS
> /* sched_balance_rq() stats */
> unsigned int lb_count[CPU_MAX_IDLE_TYPES];
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index ca46b5cf7f78..dae78b5915a7 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2723,6 +2778,13 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> }
> rcu_read_unlock();
>
> + /*
> + * Ensure we see enlarged sd->pf when we use new llc_ids and
> + * bigger max_llcs.
> + */
> + smp_mb();
> + max_llcs = tl_max_llcs;
This seems wrong. This is *after* cpu_attach_domain() which publishes
@sd. How about you do something like:
struct sched_domain {
...
unsigned int llc_max;
unsigned int *llc_counts __counted_by(llc_max);
}
Then you always carry matching information that is published together.
> if (has_asym)
> static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
>
> --
> 2.32.0
>
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 08/21] sched/cache: Calculate the percpu sd task LLC preference
2026-02-10 22:18 ` [PATCH v3 08/21] sched/cache: Calculate the percpu sd task LLC preference Tim Chen
@ 2026-02-20 11:02 ` Peter Zijlstra
2026-02-20 14:02 ` Peter Zijlstra
0 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-20 11:02 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
On Tue, Feb 10, 2026 at 02:18:48PM -0800, Tim Chen wrote:
> static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
> {
> + struct sched_domain *sd;
> int pref_llc;
>
> pref_llc = p->preferred_llc;
> - if (pref_llc < 0)
> + if (!valid_llc_id(pref_llc))
> return;
>
> rq->nr_llc_running++;
> rq->nr_pref_llc_running += (pref_llc == task_llc(p));
> +
> + scoped_guard (rcu) {
> + sd = rcu_dereference(rq->sd);
> + if (valid_llc_buf(sd, pref_llc))
> + sd->pf[pref_llc]++;
> + }
> }
>
> static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
> {
> + struct sched_domain *sd;
> int pref_llc;
>
> pref_llc = p->preferred_llc;
> - if (pref_llc < 0)
> + if (!valid_llc_id(pref_llc))
> return;
>
> rq->nr_llc_running--;
> rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
> +
> + scoped_guard (rcu) {
> + sd = rcu_dereference(rq->sd);
> + if (valid_llc_buf(sd, pref_llc)) {
> + /*
> + * There is a race condition between dequeue
> + * and CPU hotplug. After a task has been enqueued
> + * on CPUx, a CPU hotplug event occurs, and all online
> + * CPUs (including CPUx) rebuild their sched_domains
> + * and reset statistics to zero (including sd->pf).
> + * This can cause temporary undercount and we have to
> + * check for such underflow in sd->pf.
> + *
> + * This undercount is temporary and accurate accounting
> + * will resume once the rq has a chance to be idle.
> + */
> + if (sd->pf[pref_llc])
> + sd->pf[pref_llc]--;
> + }
> + }
> }
FWIW, enqueue/dequeue must be with rq->lock held, and thus preemption
disabled and IRQs off. That RCU section is completely pointless.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 09/21] sched/cache: Count tasks prefering destination LLC in a sched group
2026-02-10 22:18 ` [PATCH v3 09/21] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
@ 2026-02-20 12:52 ` Peter Zijlstra
2026-02-20 13:43 ` Peter Zijlstra
0 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-20 12:52 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
On Tue, Feb 10, 2026 at 02:18:49PM -0800, Tim Chen wrote:
> @@ -11034,6 +11037,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> {
> int i, nr_running, local_group, sd_flags = env->sd->flags;
> bool balancing_at_rd = !env->sd->parent;
> +#ifdef CONFIG_SCHED_CACHE
> + int dst_llc = llc_id(env->dst_cpu);
> +#endif
>
> memset(sgs, 0, sizeof(*sgs));
>
> @@ -11054,6 +11060,15 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> if (cpu_overutilized(i))
> *sg_overutilized = 1;
>
> +#ifdef CONFIG_SCHED_CACHE
> + if (sched_cache_enabled() && llc_id(i) != dst_llc) {
If you write that like:
if (sched_cache_enabled && llc_id(i) != llc_id(env->dst_cpu))
You can get rid of that dst_llc variable, but more importantly its
ifdeffery.
> + struct sched_domain *sd_tmp = rcu_dereference(rq->sd);
> +
> + if (valid_llc_buf(sd_tmp, dst_llc))
> + sgs->nr_pref_dst_llc += sd_tmp->pf[dst_llc];
> + }
> +#endif
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 09/21] sched/cache: Count tasks prefering destination LLC in a sched group
2026-02-20 12:52 ` Peter Zijlstra
@ 2026-02-20 13:43 ` Peter Zijlstra
2026-02-21 2:53 ` Chen, Yu C
0 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-20 13:43 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
On Fri, Feb 20, 2026 at 01:52:48PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 10, 2026 at 02:18:49PM -0800, Tim Chen wrote:
>
> > @@ -11034,6 +11037,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > {
> > int i, nr_running, local_group, sd_flags = env->sd->flags;
> > bool balancing_at_rd = !env->sd->parent;
> > +#ifdef CONFIG_SCHED_CACHE
> > + int dst_llc = llc_id(env->dst_cpu);
> > +#endif
> >
> > memset(sgs, 0, sizeof(*sgs));
> >
> > @@ -11054,6 +11060,15 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > if (cpu_overutilized(i))
> > *sg_overutilized = 1;
> >
> > +#ifdef CONFIG_SCHED_CACHE
> > + if (sched_cache_enabled() && llc_id(i) != dst_llc) {
>
> If you write that like:
>
> if (sched_cache_enabled && llc_id(i) != llc_id(env->dst_cpu))
>
> You can get rid of that dst_llc variable, but more importantly its
> ifdeffery.
Ah, you're perhaps wanting to not re-load on the dst_llc usage below? Do
the compilers DTRT when you mark llc_id() as __pure?
> > + struct sched_domain *sd_tmp = rcu_dereference(rq->sd);
> > +
> > + if (valid_llc_buf(sd_tmp, dst_llc))
> > + sgs->nr_pref_dst_llc += sd_tmp->pf[dst_llc];
> > + }
> > +#endif
>
>
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 13/21] sched/cache: Handle moving single tasks to/from their preferred LLC
2026-02-10 22:18 ` [PATCH v3 13/21] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
2026-02-17 19:00 ` Madadi Vineeth Reddy
@ 2026-02-20 13:53 ` Peter Zijlstra
2026-02-20 18:22 ` Tim Chen
1 sibling, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-20 13:53 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
On Tue, Feb 10, 2026 at 02:18:53PM -0800, Tim Chen wrote:
> In the generic load balance(non-cache-aware-load-balance),
> if the busiest runqueue has only one task, active balancing may be
> invoked to move it. However, this migration might break LLC locality.
I'm thinking this wants more explanation; yes it might break locality,
but why is that bad?
This way you're inhibiting a full LLC from being able to power down.
> +/*
> + * Check if active load balance breaks LLC locality in
> + * terms of cache aware load balance.
> + */
> +static inline bool
> +alb_break_llc(struct lb_env *env)
> +{
> + if (!sched_cache_enabled())
> + return false;
> +
> + if (cpus_share_cache(env->src_cpu, env->dst_cpu))
> + return false;
> + /*
> + * All tasks prefer to stay on their current CPU.
> + * Do not pull a task from its preferred CPU if:
> + * 1. It is the only task running there; OR
> + * 2. Migrating it away from its preferred LLC would violate
> + * the cache-aware scheduling policy.
> + */
> + if (env->src_rq->nr_pref_llc_running &&
> + env->src_rq->nr_pref_llc_running == env->src_rq->cfs.h_nr_runnable) {
> + unsigned long util = 0;
> + struct task_struct *cur;
> +
> + if (env->src_rq->nr_running <= 1)
> + return true;
> +
> + /*
> + * Reach here in load balance with
> + * rcu_read_lock() protected.
> + */
Not sure that comment helps much. rcu_dereference() itself will already
cause complaints if the constraints are violated.
> + cur = rcu_dereference(env->src_rq->curr);
> + if (cur)
> + util = task_util(cur);
> +
> + if (can_migrate_llc(env->src_cpu, env->dst_cpu,
> + util, false) == mig_forbid)
> + return true;
> + }
> +
> + return false;
> +}
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 08/21] sched/cache: Calculate the percpu sd task LLC preference
2026-02-20 11:02 ` Peter Zijlstra
@ 2026-02-20 14:02 ` Peter Zijlstra
2026-02-20 17:25 ` Chen, Yu C
0 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-20 14:02 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
On Fri, Feb 20, 2026 at 12:02:22PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 10, 2026 at 02:18:48PM -0800, Tim Chen wrote:
> > static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
> > {
> > + struct sched_domain *sd;
> > int pref_llc;
> >
> > pref_llc = p->preferred_llc;
> > - if (pref_llc < 0)
> > + if (!valid_llc_id(pref_llc))
> > return;
> >
> > rq->nr_llc_running++;
> > rq->nr_pref_llc_running += (pref_llc == task_llc(p));
> > +
> > + scoped_guard (rcu) {
> > + sd = rcu_dereference(rq->sd);
> > + if (valid_llc_buf(sd, pref_llc))
> > + sd->pf[pref_llc]++;
> > + }
> > }
> >
> > static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
> > {
> > + struct sched_domain *sd;
> > int pref_llc;
> >
> > pref_llc = p->preferred_llc;
> > - if (pref_llc < 0)
> > + if (!valid_llc_id(pref_llc))
> > return;
> >
> > rq->nr_llc_running--;
> > rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
> > +
> > + scoped_guard (rcu) {
> > + sd = rcu_dereference(rq->sd);
> > + if (valid_llc_buf(sd, pref_llc)) {
> > + /*
> > + * There is a race condition between dequeue
> > + * and CPU hotplug. After a task has been enqueued
> > + * on CPUx, a CPU hotplug event occurs, and all online
> > + * CPUs (including CPUx) rebuild their sched_domains
> > + * and reset statistics to zero (including sd->pf).
> > + * This can cause temporary undercount and we have to
> > + * check for such underflow in sd->pf.
> > + *
> > + * This undercount is temporary and accurate accounting
> > + * will resume once the rq has a chance to be idle.
> > + */
> > + if (sd->pf[pref_llc])
> > + sd->pf[pref_llc]--;
> > + }
> > + }
> > }
>
> FWIW, enqueue/dequeue must be with rq->lock held, and thus preemption
> disabled and IRQs off. That RCU section is completely pointless.
That is, use rcu_dereference_all() and observe the warning go away.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 19/21] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling
2026-02-10 22:18 ` [PATCH v3 19/21] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Tim Chen
@ 2026-02-20 14:29 ` Peter Zijlstra
2026-02-20 18:18 ` Tim Chen
0 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-20 14:29 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Tue, Feb 10, 2026 at 02:18:59PM -0800, Tim Chen wrote:
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index bae747eddc59..dc4b7de6569f 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -566,6 +566,16 @@ static __init int sched_init_debug(void)
> #ifdef CONFIG_SCHED_CACHE
> debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
> &sched_cache_enable_fops);
> + debugfs_create_u32("llc_aggr_tolerance", 0644, debugfs_sched,
> + &llc_aggr_tolerance);
> + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
> + &llc_epoch_period);
> + debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
> + &llc_epoch_affinity_timeout);
> + debugfs_create_u32("llc_overaggr_pct", 0644, debugfs_sched,
> + &llc_overaggr_pct);
> + debugfs_create_u32("llc_imb_pct", 0644, debugfs_sched,
> + &llc_imb_pct);
> #endif
So we have debug/sched/numa_balancing/, would it make sense to stick all
of this into debug/sched/llc_balancing/ or so?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-19 15:48 ` Peter Zijlstra
@ 2026-02-20 15:22 ` Chen, Yu C
0 siblings, 0 replies; 117+ messages in thread
From: Chen, Yu C @ 2026-02-20 15:22 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Tim Chen,
Gautham R . Shenoy, Vincent Guittot, Ingo Molnar, K Prateek Nayak
On 2/19/2026 11:48 PM, Peter Zijlstra wrote:
> On Tue, Feb 17, 2026 at 01:39:45PM +0530, K Prateek Nayak wrote:
>> I'm not sure if this is technically possible but assume following
>> topology:
>>
>> [ LLC: 8-15 ]
>> [ SMT: 8,9 ][ SMT: 10,11 ] ... [ SMT: 14,15 ]
>>
>> and the following series of events:
>>
>> o All CPUs in LLC are offline to begin with (maxcpus = 1 like scenario).
>>
>> o CPUs 10-15 are onlined first.
>>
>> o CPU8 is put in a separate root partition and brought online.
>> (XXX: I'm not 100% sure if this is possible in this order)
>>
>> o build_sched_domains() will bail out at SMT domain since the cpumap
>> is covered by tl->mask() and tl_llc = tl_smt.
>>
>> o llc_id calculation uses the tl_smt->mask() which will not contain
>> CPUs 10-15 and CPU8 will get a unique LLC id even though there are
>> other online CPUs in the LLC with a different llc_id (!!!)
>
> Yeah, so partitions (including isol_cpus) could wreck things here, since
> this is purely about the sched_domains.
>
> You can create N single CPU partitions (isol_cpus does this) and end up
> with the same 'problem' that online one at a time loop did. Except this
> time it would not be 'wrong'. Since they are single CPU domains, you
> also don't get load-balancing, so who cares I suppose. But it will
> inflate max_lid.
>
> But suppose you create N/2 partitions (where N is the number of CPUs in
> the physical LLC), then you get many individual 'LLC's and
> load-balancing inside them. I suppose this is correct, although it does
> inflate max_lid somewhat beyond what you would normally expect.
>
> However, most of that space would be wasted, since you're not actually
> allowed to migrate to them.
>
Besides wasting space, after removing CPUs from all N/2 partitions and
merging them into the root partition, each CPU would still have a distinct
llc_id from the other CPUs in the same LLC domain, because we do not
reassign
llc_id values to CPUs in current version. This issue should be resolved by
switching to the new dynamic llc_id allocation and release method.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-19 15:40 ` Peter Zijlstra
@ 2026-02-20 15:53 ` Chen, Yu C
2026-02-20 16:03 ` Peter Zijlstra
0 siblings, 1 reply; 117+ messages in thread
From: Chen, Yu C @ 2026-02-20 15:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tim Chen, Ingo Molnar, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, K Prateek Nayak
Hi Peter,
On 2/19/2026 11:40 PM, Peter Zijlstra wrote:
> On Mon, Feb 16, 2026 at 01:14:20PM +0530, K Prateek Nayak wrote:
[ ... ]
>> +static int __sched_domains_alloc_llc_id(void)
>> +{
>> + int lid;
>> +
>> + lockdep_assert_held(&sched_domains_mutex);
>> +
>> + lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
>> + if (lid >= tl_max_llcs)
>> + tl_max_llcs++;
>
> Urgh,. should we not rather track the max lid?
>
Do you mean we should not always increment the max lid,
but instead decrease it when an llc_id is released?
I think Tim has adjusted the code to shrink tl_max_llcs
when an llc_id is released:
https://lore.kernel.org/all/acc7a5c96e8235bf11af640798ce1b60bcaa8196.camel@linux.intel.com/
> Also, we allocate max_llc sized data structures, if this thing is
> 'variable' we must also always store a copy of the 'lid' size of the
> time of allocation.
>
Do you mean we should save the latest llc_max in the sched_domain
and publish it during sd attachment, as suggested at:
https://lore.kernel.org/all/20260220104533.GO1395266@noisy.programming.kicks-ass.net/
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-20 15:53 ` Chen, Yu C
@ 2026-02-20 16:03 ` Peter Zijlstra
2026-02-20 16:10 ` Chen, Yu C
0 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-20 16:03 UTC (permalink / raw)
To: Chen, Yu C
Cc: Tim Chen, Ingo Molnar, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, K Prateek Nayak
On Fri, Feb 20, 2026 at 11:53:31PM +0800, Chen, Yu C wrote:
> Hi Peter,
>
> On 2/19/2026 11:40 PM, Peter Zijlstra wrote:
> > On Mon, Feb 16, 2026 at 01:14:20PM +0530, K Prateek Nayak wrote:
>
> [ ... ]
>
> > > +static int __sched_domains_alloc_llc_id(void)
> > > +{
> > > + int lid;
> > > +
> > > + lockdep_assert_held(&sched_domains_mutex);
> > > +
> > > + lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
> > > + if (lid >= tl_max_llcs)
> > > + tl_max_llcs++;
> >
> > Urgh,. should we not rather track the max lid?
> >
>
> Do you mean we should not always increment the max lid,
> but instead decrease it when an llc_id is released?
> I think Tim has adjusted the code to shrink tl_max_llcs
> when an llc_id is released:
> https://lore.kernel.org/all/acc7a5c96e8235bf11af640798ce1b60bcaa8196.camel@linux.intel.com/
You can only shrink when the max lid is released. Since lid is an array
index, something like max_lid = weight(mask) would be terribly broken.
But what I was getting at is that the code as presented there is rather
non-obvious. Yes, if the lid is higher, it cannot be more than one
higher than the previous value, but something like:
lid = cpumask_first_zero();
BUG_ON(lid >= nr_cpu_ids);
max_lid = max(max_lid, lid);
Is way simpler to follow since it doesn't have that hidden assumption.
Then, if you want to allow shrinking, then the clear side could do
something like:
__cpumask_clear(lid, mask);
if (lid == max_lid)
max_lid = cpumask_last(mask);
or something like that.
> > Also, we allocate max_llc sized data structures, if this thing is
> > 'variable' we must also always store a copy of the 'lid' size of the
> > time of allocation.
> >
>
> Do you mean we should save the latest llc_max in the sched_domain
> and publish it during sd attachment, as suggested at:
>
> https://lore.kernel.org/all/20260220104533.GO1395266@noisy.programming.kicks-ass.net/
Yeah, having it separated like it is now feels super fragile.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-20 16:03 ` Peter Zijlstra
@ 2026-02-20 16:10 ` Chen, Yu C
2026-02-20 19:24 ` Tim Chen
0 siblings, 1 reply; 117+ messages in thread
From: Chen, Yu C @ 2026-02-20 16:10 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tim Chen, Ingo Molnar, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, K Prateek Nayak
On 2/21/2026 12:03 AM, Peter Zijlstra wrote:
> On Fri, Feb 20, 2026 at 11:53:31PM +0800, Chen, Yu C wrote:
>> Hi Peter,
>>
>> On 2/19/2026 11:40 PM, Peter Zijlstra wrote:
>>> On Mon, Feb 16, 2026 at 01:14:20PM +0530, K Prateek Nayak wrote:
>>
>> [ ... ]
>>
>>>> +static int __sched_domains_alloc_llc_id(void)
>>>> +{
>>>> + int lid;
>>>> +
>>>> + lockdep_assert_held(&sched_domains_mutex);
>>>> +
>>>> + lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
>>>> + if (lid >= tl_max_llcs)
>>>> + tl_max_llcs++;
>>>
>>> Urgh,. should we not rather track the max lid?
>>>
>>
>> Do you mean we should not always increment the max lid,
>> but instead decrease it when an llc_id is released?
>> I think Tim has adjusted the code to shrink tl_max_llcs
>> when an llc_id is released:
>> https://lore.kernel.org/all/acc7a5c96e8235bf11af640798ce1b60bcaa8196.camel@linux.intel.com/
>
> You can only shrink when the max lid is released. Since lid is an array
> index, something like max_lid = weight(mask) would be terribly broken.
>
> But what I was getting at is that the code as presented there is rather
> non-obvious. Yes, if the lid is higher, it cannot be more than one
> higher than the previous value, but something like:
>
> lid = cpumask_first_zero();
> BUG_ON(lid >= nr_cpu_ids);
> max_lid = max(max_lid, lid);
>
> Is way simpler to follow since it doesn't have that hidden assumption.
>
> Then, if you want to allow shrinking, then the clear side could do
> something like:
>
> __cpumask_clear(lid, mask);
> if (lid == max_lid)
> max_lid = cpumask_last(mask);
>
> or something like that.
>
Got it, we will adjust the code accordingly.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 07/21] sched/cache: Introduce per CPU's tasks LLC preference counter
2026-02-20 10:45 ` Peter Zijlstra
@ 2026-02-20 16:57 ` Chen, Yu C
2026-02-20 18:38 ` Peter Zijlstra
0 siblings, 1 reply; 117+ messages in thread
From: Chen, Yu C @ 2026-02-20 16:57 UTC (permalink / raw)
To: Peter Zijlstra, Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On 2/20/2026 6:45 PM, Peter Zijlstra wrote:
> On Tue, Feb 10, 2026 at 02:18:47PM -0800, Tim Chen wrote:
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index a4e2fb31f2fd..3aa6c101b2e4 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -102,6 +102,10 @@ struct sched_domain {
>> u64 max_newidle_lb_cost;
>> unsigned long last_decay_max_lb_cost;
>>
>> +#ifdef CONFIG_SCHED_CACHE
>> + unsigned int *pf;
>
> So I'm all for short names; but perhaps this could be better. When
> reading this my brain went page-fault, and then WTF :-)
>
OK, I assume you are suggesting renaming it to llc_counts.
>> +#endif
>> +
>> #ifdef CONFIG_SCHEDSTATS
>> /* sched_balance_rq() stats */
>> unsigned int lb_count[CPU_MAX_IDLE_TYPES];
>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index ca46b5cf7f78..dae78b5915a7 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>
>> @@ -2723,6 +2778,13 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>> }
>> rcu_read_unlock();
>>
>> + /*
>> + * Ensure we see enlarged sd->pf when we use new llc_ids and
>> + * bigger max_llcs.
>> + */
>> + smp_mb();
>> + max_llcs = tl_max_llcs;
>
> This seems wrong. This is *after* cpu_attach_domain() which publishes
> @sd. How about you do something like:
>
> struct sched_domain {
> ...
>
> unsigned int llc_max;
> unsigned int *llc_counts __counted_by(llc_max);
> }
>
> Then you always carry matching information that is published together.
>
OK, we will change it accordingly.
Additionally, with this change we should be able to safely read
the data in the sched_domain by verifying whether the target llc_id
falls within the valid range(to avoid a race condition):
CPU0 CPU1
:
...
build_sched_domains update_sg_lb_stats
for_each_cpu_and(i, sg)
sd=rq[i]->sd
per_cpu(sd_llc_id,i)=new_llc
llc=llc_id(i)
if(llc<sd->llc_max)
safe read sd->pf[llc]
alloc_sd_pref(cpu_map)
sd->llc_counts=kzalloc()
sd->llc_max=max_llc
Thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-19 21:04 ` Tim Chen
@ 2026-02-20 17:17 ` Chen, Yu C
0 siblings, 0 replies; 117+ messages in thread
From: Chen, Yu C @ 2026-02-20 17:17 UTC (permalink / raw)
To: Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Peter Zijlstra
On 2/20/2026 5:04 AM, Tim Chen wrote:
> On Thu, 2026-02-19 at 11:20 -0800, Tim Chen wrote:
>> On Thu, 2026-02-19 at 23:20 +0800, Chen, Yu C wrote:
>>> On 2/19/2026 10:59 PM, Peter Zijlstra wrote:
>>>> On Tue, Feb 10, 2026 at 02:18:44PM -0800, Tim Chen wrote:
[ ... ]
>>>>> +
>>>>> + lid = per_cpu(sd_llc_id, i);
>>>>> + if (lid == -1) {
>>>>> + int j;
>>>>> +
>>>>> + /*
>>>>> + * Assign the llc_id to the CPUs that do not
>>>>> + * have an LLC.
>>>>> + */
>>>>
>>>> Where does this happen? Is this for things like Atom that don't have an
>>>> L3 and so we don't set up a LLC domain?
>>>>
>>>
>>> Yes, for some hybrid platforms, some CPUs on that platforms might not
>>> have L3,
>>> Tim might correct me if I’m wrong. Above code is derived from the
>>> update_top_cache_domain(),
>>> if there is no sched domain with SD_SHARE_LLC, per_cpu(sd_llc_id, cpu)
>>> is set to the
>>> CPU number directly.
>>>
>>
>> That's correct. One example is Meteor Lake where some Atom CPUs don't have
>> L3 but have only L2. And some Ampere CPUs also have no shared L3.
>>
>> https://www.spinics.net/lists/kernel/msg5863118.html?utm_source=chatgpt.com
>>
>> This also reminded me that if we rely on cpu_coregroup_mask for LLC id
>> assignment, we may be missing out such platforms which need to treat
>> L2 as the last level cache. So we may need to fallback to cpu_clustergroup_mask
>> or cpu_smt_mask where applicable.
>
> On further inspection of the code, cpu_coregroup_mask will just be the same
> as cpu_clustergroup_mask for that case so we should be okay.
>
OK, I assume this is true for Intel platforms because the llc_id will
be set to l2_id if there is no L3 cache:
c->topo.llc_id = (l3_id == BAD_APICID) ? l2_id : l3_id;
I suppose AMD platforms should not be impacted because I have not seen
any non-L3 platforms (for AMD).
For non-x86 platforms, cpu_coregroup_mask() will be converted to the
cluster mask if no LLC is present.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 08/21] sched/cache: Calculate the percpu sd task LLC preference
2026-02-20 14:02 ` Peter Zijlstra
@ 2026-02-20 17:25 ` Chen, Yu C
0 siblings, 0 replies; 117+ messages in thread
From: Chen, Yu C @ 2026-02-20 17:25 UTC (permalink / raw)
To: Peter Zijlstra, Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On 2/20/2026 10:02 PM, Peter Zijlstra wrote:
> On Fri, Feb 20, 2026 at 12:02:22PM +0100, Peter Zijlstra wrote:
>> On Tue, Feb 10, 2026 at 02:18:48PM -0800, Tim Chen wrote:
>>> static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
>>> {
>>> + struct sched_domain *sd;
>>> int pref_llc;
>>>
>>> pref_llc = p->preferred_llc;
>>> - if (pref_llc < 0)
>>> + if (!valid_llc_id(pref_llc))
>>> return;
>>>
>>> rq->nr_llc_running++;
>>> rq->nr_pref_llc_running += (pref_llc == task_llc(p));
>>> +
>>> + scoped_guard (rcu) {
>>> + sd = rcu_dereference(rq->sd);
>>> + if (valid_llc_buf(sd, pref_llc))
>>> + sd->pf[pref_llc]++;
>>> + }
>>> }
>>>
>>> static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
>>> {
>>> + struct sched_domain *sd;
>>> int pref_llc;
>>>
>>> pref_llc = p->preferred_llc;
>>> - if (pref_llc < 0)
>>> + if (!valid_llc_id(pref_llc))
>>> return;
>>>
>>> rq->nr_llc_running--;
>>> rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
>>> +
>>> + scoped_guard (rcu) {
>>> + sd = rcu_dereference(rq->sd);
>>> + if (valid_llc_buf(sd, pref_llc)) {
>>> + /*
>>> + * There is a race condition between dequeue
>>> + * and CPU hotplug. After a task has been enqueued
>>> + * on CPUx, a CPU hotplug event occurs, and all online
>>> + * CPUs (including CPUx) rebuild their sched_domains
>>> + * and reset statistics to zero (including sd->pf).
>>> + * This can cause temporary undercount and we have to
>>> + * check for such underflow in sd->pf.
>>> + *
>>> + * This undercount is temporary and accurate accounting
>>> + * will resume once the rq has a chance to be idle.
>>> + */
>>> + if (sd->pf[pref_llc])
>>> + sd->pf[pref_llc]--;
>>> + }
>>> + }
>>> }
>>
>> FWIW, enqueue/dequeue must be with rq->lock held, and thus preemption
>> disabled and IRQs off. That RCU section is completely pointless.
>
> That is, use rcu_dereference_all() and observe the warning go away.
OK we will remove rcu_read_lock() and use rcu_dereference_all() directly.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-20 3:29 ` Qais Yousef
2026-02-20 9:43 ` Peter Zijlstra
@ 2026-02-20 18:14 ` Tim Chen
2026-02-24 3:02 ` Qais Yousef
1 sibling, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-20 18:14 UTC (permalink / raw)
To: Qais Yousef
Cc: Chen, Yu C, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
Yangyu Chen, Tingyin Duan, Vern Hao, Vern Hao, Len Brown,
Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen,
Josh Don, Gavin Guo, Libo Chen, linux-kernel
On Fri, 2026-02-20 at 03:29 +0000, Qais Yousef wrote:
> On 02/19/26 10:11, Tim Chen wrote:
> > On Thu, 2026-02-19 at 23:07 +0800, Chen, Yu C wrote:
> > > Hi Peter, Qais,
> > >
> > > On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
> > > > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > > > > On 02/10/26 14:18, Tim Chen wrote:
> > >
> > > [ ... ]
> > >
> > > > >
> > > > > I admit yet to look fully at the series. But I must ask, why are you deferring
> > > > > to load balance and not looking at wake up path? LB should be for corrections.
> > > > > When wake up path is doing wrong decision all the time, LB (which is super slow
> > > > > to react) is too late to start grouping tasks? What am I missing?
> > > >
> > > > There used to be wakeup steering, but I'm not sure that still exists in
> > > > this version (still need to read beyond the first few patches). It isn't
> > > > hard to add.
> > > >
> > >
> > > Please let me explain a little more about why we did this in the
> > > load balance path. Yes, the original version implemented cache-aware
> > > scheduling only in the wakeup path. According to our testing, this appeared
> > > to cause some task bouncing issues across LLCs. This was due to conflicts
> > > with the legacy load balancer, which tries to spread tasks to different
> > > LLCs.
> > > So as Peter said, the load balancer should be taken care of anyway. Later,
> > > we kept only the cache aware logic in the load balancer, and the test
> > > results
> > > became much more stable, so we kept it as is. The wakeup path more or less
> > > aggregates the wakees(threads within the same process) within the LLC in
> > > the
> > > wakeup fast path, so we have not changed it for now.
> > >
> > > Let me copy the changelog from the previous patch version:
> > >
> > > "
> > > In previous versions, aggregation of tasks were done in the
> > > wake up path, without making load balancing paths aware of
> > > LLC (Last-Level-Cache) preference. This led to the following
> > > problems:
> > >
> > > 1) Aggregation of tasks during wake up led to load imbalance
> > > between LLCs
> > > 2) Load balancing tried to even out the load between LLCs
> > > 3) Wake up tasks aggregation happened at a faster rate and
> > > load balancing moved tasks in opposite directions, leading
> > > to continuous and excessive task migrations and regressions
> > > in benchmarks like schbench.
> > >
> > > In this version, load balancing is made cache-aware. The main
> > > idea of cache-aware load balancing consists of two parts:
> > >
> > > 1) Identify tasks that prefer to run on their hottest LLC and
> > > move them there.
> > > 2) Prevent generic load balancing from moving a task out of
> > > its hottest LLC.
> > > "
> > >
> >
> > Another reason why we moved away from doing things in the wake up
> > path is load imbalance consideration. Wake up path does not have
> > the most up to date load information in the LLC sched domains as
> > in the load balance path. So you may actually have everyone rushed
>
> What's the reason wake up doesn't have the latest info? Is this a limitation of
> these large systems where stats updates are too expensive to do? Is it not
> fixable at all?
You will need to sum the load for each run queue for each LLC to get
an accurate picture. That will be too expensive on the wake up path.
Tim
>
> > into each's favorite LLC and causes LLC overload. And load balance
> > will have to undo this. This led to frequent task migrations that
> > hurts performance.
> >
> > It is better to consider LLC preference in the load balance path
> > so we can aggregate tasks while still keeping load imbalance under
> > control.
> >
> > Tim
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 19/21] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling
2026-02-20 14:29 ` Peter Zijlstra
@ 2026-02-20 18:18 ` Tim Chen
0 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-20 18:18 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Fri, 2026-02-20 at 15:29 +0100, Peter Zijlstra wrote:
> On Tue, Feb 10, 2026 at 02:18:59PM -0800, Tim Chen wrote:
>
> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > index bae747eddc59..dc4b7de6569f 100644
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -566,6 +566,16 @@ static __init int sched_init_debug(void)
> > #ifdef CONFIG_SCHED_CACHE
> > debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
> > &sched_cache_enable_fops);
> > + debugfs_create_u32("llc_aggr_tolerance", 0644, debugfs_sched,
> > + &llc_aggr_tolerance);
> > + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
> > + &llc_epoch_period);
> > + debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
> > + &llc_epoch_affinity_timeout);
> > + debugfs_create_u32("llc_overaggr_pct", 0644, debugfs_sched,
> > + &llc_overaggr_pct);
> > + debugfs_create_u32("llc_imb_pct", 0644, debugfs_sched,
> > + &llc_imb_pct);
> > #endif
>
> So we have debug/sched/numa_balancing/, would it make sense to stick all
> of this into debug/sched/llc_balancing/ or so?
That's a good suggestion. Will do that in follow on version.
Tim
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 13/21] sched/cache: Handle moving single tasks to/from their preferred LLC
2026-02-20 13:53 ` Peter Zijlstra
@ 2026-02-20 18:22 ` Tim Chen
0 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-20 18:22 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, linux-kernel
On Fri, 2026-02-20 at 14:53 +0100, Peter Zijlstra wrote:
> On Tue, Feb 10, 2026 at 02:18:53PM -0800, Tim Chen wrote:
> > In the generic load balance(non-cache-aware-load-balance),
> > if the busiest runqueue has only one task, active balancing may be
> > invoked to move it. However, this migration might break LLC locality.
>
> I'm thinking this wants more explanation; yes it might break locality,
> but why is that bad?
That's to prevent regular load balance from migrating a task that
prefers the current LLC, but the load level and imbalance doesn't warrant
breaking LLC preference per can_migrate_llc() policy.
Okay, will add more comments.
Tim
>
> This way you're inhibiting a full LLC from being able to power down.
>
> > +/*
> > + * Check if active load balance breaks LLC locality in
> > + * terms of cache aware load balance.
> > + */
> > +static inline bool
> > +alb_break_llc(struct lb_env *env)
> > +{
> > + if (!sched_cache_enabled())
> > + return false;
> > +
> > + if (cpus_share_cache(env->src_cpu, env->dst_cpu))
> > + return false;
> > + /*
> > + * All tasks prefer to stay on their current CPU.
> > + * Do not pull a task from its preferred CPU if:
> > + * 1. It is the only task running there; OR
> > + * 2. Migrating it away from its preferred LLC would violate
> > + * the cache-aware scheduling policy.
> > + */
> > + if (env->src_rq->nr_pref_llc_running &&
> > + env->src_rq->nr_pref_llc_running == env->src_rq->cfs.h_nr_runnable) {
> > + unsigned long util = 0;
> > + struct task_struct *cur;
> > +
> > + if (env->src_rq->nr_running <= 1)
> > + return true;
> > +
> > + /*
> > + * Reach here in load balance with
> > + * rcu_read_lock() protected.
> > + */
>
> Not sure that comment helps much. rcu_dereference() itself will already
> cause complaints if the constraints are violated.
Okay, will remove this.
>
> > + cur = rcu_dereference(env->src_rq->curr);
> > + if (cur)
> > + util = task_util(cur);
> > +
> > + if (can_migrate_llc(env->src_cpu, env->dst_cpu,
> > + util, false) == mig_forbid)
> > + return true;
> > + }
> > +
> > + return false;
> > +}
>
>
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 07/21] sched/cache: Introduce per CPU's tasks LLC preference counter
2026-02-20 16:57 ` Chen, Yu C
@ 2026-02-20 18:38 ` Peter Zijlstra
0 siblings, 0 replies; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-20 18:38 UTC (permalink / raw)
To: Chen, Yu C
Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On Sat, Feb 21, 2026 at 12:57:38AM +0800, Chen, Yu C wrote:
> llc=llc_id(i)
> if(llc<sd->llc_max)
> safe read sd->pf[llc]
Right, except llc_id() is allowed to return negative, so that would need
to be something like:
if ((unsigned)llc < sd->llc_max)
sd->llc_count[llc]
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-20 16:10 ` Chen, Yu C
@ 2026-02-20 19:24 ` Tim Chen
2026-02-20 19:30 ` Peter Zijlstra
0 siblings, 1 reply; 117+ messages in thread
From: Tim Chen @ 2026-02-20 19:24 UTC (permalink / raw)
To: Chen, Yu C, Peter Zijlstra
Cc: Ingo Molnar, Gautham R . Shenoy, Vincent Guittot, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Madadi Vineeth Reddy, Hillf Danton,
Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li,
Aaron Lu, Tim Chen, Josh Don, Gavin Guo, Qais Yousef, Libo Chen,
linux-kernel, K Prateek Nayak
On Sat, 2026-02-21 at 00:10 +0800, Chen, Yu C wrote:
> On 2/21/2026 12:03 AM, Peter Zijlstra wrote:
> > On Fri, Feb 20, 2026 at 11:53:31PM +0800, Chen, Yu C wrote:
> > > Hi Peter,
> > >
> > > On 2/19/2026 11:40 PM, Peter Zijlstra wrote:
> > > > On Mon, Feb 16, 2026 at 01:14:20PM +0530, K Prateek Nayak wrote:
> > >
> > > [ ... ]
> > >
> > > > > +static int __sched_domains_alloc_llc_id(void)
> > > > > +{
> > > > > + int lid;
> > > > > +
> > > > > + lockdep_assert_held(&sched_domains_mutex);
> > > > > +
> > > > > + lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
> > > > > + if (lid >= tl_max_llcs)
> > > > > + tl_max_llcs++;
> > > >
> > > > Urgh,. should we not rather track the max lid?
> > > >
> > >
> > > Do you mean we should not always increment the max lid,
> > > but instead decrease it when an llc_id is released?
> > > I think Tim has adjusted the code to shrink tl_max_llcs
> > > when an llc_id is released:
> > > https://lore.kernel.org/all/acc7a5c96e8235bf11af640798ce1b60bcaa8196.camel@linux.intel.com/
> >
> > You can only shrink when the max lid is released. Since lid is an array
> > index, something like max_lid = weight(mask) would be terribly broken.
> >
> > But what I was getting at is that the code as presented there is rather
> > non-obvious. Yes, if the lid is higher, it cannot be more than one
> > higher than the previous value, but something like:
> >
> > lid = cpumask_first_zero();
> > BUG_ON(lid >= nr_cpu_ids);
> > max_lid = max(max_lid, lid);
> >
> > Is way simpler to follow since it doesn't have that hidden assumption.
> >
> > Then, if you want to allow shrinking, then the clear side could do
> > something like:
> >
> > __cpumask_clear(lid, mask);
> > if (lid == max_lid)
> > max_lid = cpumask_last(mask);
> >
> > or something like that.
> >
>
> Got it, we will adjust the code accordingly.
>
How about modifying this patch like the following:
Thanks.
Tim
---
diff --git a/init/Kconfig b/init/Kconfig
index f4b2649f8401..da405c00e9e3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -994,6 +994,7 @@ config SCHED_CACHE
bool "Cache aware load balance"
default y
depends on SMP
+ depends on SCHED_MC
help
When enabled, the scheduler will attempt to aggregate tasks from
the same process onto a single Last Level Cache (LLC) domain when
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c464e370576f..e34b5842caa4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8372,6 +8372,8 @@ int sched_cpu_deactivate(unsigned int cpu)
*/
synchronize_rcu();
+ sched_domains_free_llc_id(cpu);
+
sched_set_rq_offline(rq, cpu);
scx_rq_deactivate(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f4785f84b1f1..3096adc13074 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3932,6 +3932,13 @@ static inline bool sched_cache_enabled(void)
extern void sched_cache_active_set_unlocked(void);
#endif
+
+#ifdef CONFIG_SMP
+void sched_domains_free_llc_id(int cpu);
+#else /* !CONFIG_SMP: */
+static inline void sched_domains_free_llc_id(int cpu) { }
+#endif /* !CONFIG_SMP */
+
extern void init_sched_mm(struct task_struct *p);
extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e86dea1b9e86..f3bc6636170f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -18,6 +18,7 @@ void sched_domains_mutex_unlock(void)
}
/* Protected by sched_domains_mutex: */
+static cpumask_var_t sched_domains_llc_id_allocmask;
static cpumask_var_t sched_domains_tmpmask;
static cpumask_var_t sched_domains_tmpmask2;
static int tl_max_llcs;
@@ -2660,6 +2661,61 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
return true;
}
+#ifdef CONFIG_SMP
+static int __sched_domains_alloc_llc_id(void)
+{
+ int lid, max_lid;
+
+ lockdep_assert_held(&sched_domains_mutex);
+
+ lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
+ /*
+ * llc_id space should never grow larger than the
+ * possible number of CPUs in the system.
+ */
+ BUG_ON(lid >= nr_cpu_ids);
+ max_lid = cpumask_last(sched_domains_llc_id_allocmask);
+ /* size is one more than max index */
+ tl_max_llcs = max(lid, max_lid) + 1;
+
+ return lid;
+}
+
+static void __sched_domains_free_llc_id(int cpu)
+{
+ int i, lid, last_lid;
+
+ lockdep_assert_held(&sched_domains_mutex);
+
+ lid = per_cpu(sd_llc_id, cpu);
+ if (lid == -1)
+ return;
+
+ BUG_ON(lid >= nr_cpu_ids);
+ per_cpu(sd_llc_id, cpu) = -1;
+
+ for_each_cpu(i, cpu_coregroup_mask(cpu)) {
+ /* An online CPU owns the llc_id. */
+ if (per_cpu(sd_llc_id, i) == lid)
+ return;
+ }
+
+ cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
+
+ last_lid = cpumask_last(sched_domains_llc_id_allocmask);
+ /* shrink max LLC size to save memory */
+ if (last_lid < tl_max_llcs - 1)
+ tl_max_llcs = last_lid + 1;
+}
+
+void sched_domains_free_llc_id(int cpu)
+{
+ sched_domains_mutex_lock();
+ __sched_domains_free_llc_id(cpu);
+ sched_domains_mutex_unlock();
+}
+#endif /* CONFIG_SMP */
+
/*
* Build sched domains for a given set of CPUs and attach the sched domains
* to the individual CPUs
@@ -2685,18 +2741,11 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* Set up domains for CPUs specified by the cpu_map: */
for_each_cpu(i, cpu_map) {
- struct sched_domain_topology_level *tl, *tl_llc = NULL;
+ struct sched_domain_topology_level *tl;
int lid;
sd = NULL;
for_each_sd_topology(tl) {
- int flags = 0;
-
- if (tl->sd_flags)
- flags = (*tl->sd_flags)();
-
- if (flags & SD_SHARE_LLC)
- tl_llc = tl;
sd = build_sched_domain(tl, cpu_map, attr, sd, i);
@@ -2708,22 +2757,14 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
break;
}
+#ifdef CONFIG_SMP
lid = per_cpu(sd_llc_id, i);
if (lid == -1) {
int j;
- /*
- * Assign the llc_id to the CPUs that do not
- * have an LLC.
- */
- if (!tl_llc) {
- per_cpu(sd_llc_id, i) = tl_max_llcs++;
-
- continue;
- }
-
+ j = cpumask_first(cpu_coregroup_mask(i));
/* try to reuse the llc_id of its siblings */
- for_each_cpu(j, tl_llc->mask(tl_llc, i)) {
+ for (; j < nr_cpu_ids; j = cpumask_next(j, cpu_coregroup_mask(i))) {
if (i == j)
continue;
@@ -2738,8 +2779,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
/* a new LLC is detected */
if (lid == -1)
- per_cpu(sd_llc_id, i) = tl_max_llcs++;
+ per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
}
+#endif /* CONFIG_SMP */
}
if (WARN_ON(!topology_span_sane(cpu_map)))
@@ -2939,6 +2981,7 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
{
int err;
+ zalloc_cpumask_var(&sched_domains_llc_id_allocmask, GFP_KERNEL);
zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
^ permalink raw reply related [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-20 19:24 ` Tim Chen
@ 2026-02-20 19:30 ` Peter Zijlstra
2026-02-20 19:35 ` Tim Chen
0 siblings, 1 reply; 117+ messages in thread
From: Peter Zijlstra @ 2026-02-20 19:30 UTC (permalink / raw)
To: Tim Chen
Cc: Chen, Yu C, Ingo Molnar, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, K Prateek Nayak
On Fri, Feb 20, 2026 at 11:24:11AM -0800, Tim Chen wrote:
> +static int __sched_domains_alloc_llc_id(void)
> +{
> + int lid, max_lid;
> +
> + lockdep_assert_held(&sched_domains_mutex);
> +
> + lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
> + /*
> + * llc_id space should never grow larger than the
> + * possible number of CPUs in the system.
> + */
> + BUG_ON(lid >= nr_cpu_ids);
__cpumask_set_cpu(lid, sched_domains_llc_is_allocmask);
> + max_lid = cpumask_last(sched_domains_llc_id_allocmask);
> + /* size is one more than max index */
> + tl_max_llcs = max(lid, max_lid) + 1;
> +
> + return lid;
> +}
> +
> +static void __sched_domains_free_llc_id(int cpu)
> +{
> + int i, lid, last_lid;
> +
> + lockdep_assert_held(&sched_domains_mutex);
> +
> + lid = per_cpu(sd_llc_id, cpu);
> + if (lid == -1)
> + return;
> +
> + BUG_ON(lid >= nr_cpu_ids);
> + per_cpu(sd_llc_id, cpu) = -1;
> +
> + for_each_cpu(i, cpu_coregroup_mask(cpu)) {
> + /* An online CPU owns the llc_id. */
> + if (per_cpu(sd_llc_id, i) == lid)
> + return;
> + }
> +
__cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
> +
> + last_lid = cpumask_last(sched_domains_llc_id_allocmask);
> + /* shrink max LLC size to save memory */
> + if (last_lid < tl_max_llcs - 1)
> + tl_max_llcs = last_lid + 1;
> +}
Might be simpler to just track max_lid, and do the +1 at the alloc site?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous
2026-02-20 19:30 ` Peter Zijlstra
@ 2026-02-20 19:35 ` Tim Chen
0 siblings, 0 replies; 117+ messages in thread
From: Tim Chen @ 2026-02-20 19:35 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Chen, Yu C, Ingo Molnar, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, K Prateek Nayak
On Fri, 2026-02-20 at 20:30 +0100, Peter Zijlstra wrote:
> On Fri, Feb 20, 2026 at 11:24:11AM -0800, Tim Chen wrote:
>
> > +static int __sched_domains_alloc_llc_id(void)
> > +{
> > + int lid, max_lid;
> > +
> > + lockdep_assert_held(&sched_domains_mutex);
> > +
> > + lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
> > + /*
> > + * llc_id space should never grow larger than the
> > + * possible number of CPUs in the system.
> > + */
> > + BUG_ON(lid >= nr_cpu_ids);
>
> __cpumask_set_cpu(lid, sched_domains_llc_is_allocmask);
Ah yes, fat fingers delete one line too many.
>
> > + max_lid = cpumask_last(sched_domains_llc_id_allocmask);
> > + /* size is one more than max index */
> > + tl_max_llcs = max(lid, max_lid) + 1;
> > +
> > + return lid;
> > +}
> > +
> > +static void __sched_domains_free_llc_id(int cpu)
> > +{
> > + int i, lid, last_lid;
> > +
> > + lockdep_assert_held(&sched_domains_mutex);
> > +
> > + lid = per_cpu(sd_llc_id, cpu);
> > + if (lid == -1)
> > + return;
> > +
> > + BUG_ON(lid >= nr_cpu_ids);
> > + per_cpu(sd_llc_id, cpu) = -1;
> > +
> > + for_each_cpu(i, cpu_coregroup_mask(cpu)) {
> > + /* An online CPU owns the llc_id. */
> > + if (per_cpu(sd_llc_id, i) == lid)
> > + return;
> > + }
> > +
> __cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
> > +
> > + last_lid = cpumask_last(sched_domains_llc_id_allocmask);
> > + /* shrink max LLC size to save memory */
> > + if (last_lid < tl_max_llcs - 1)
> > + tl_max_llcs = last_lid + 1;
> > +}
>
> Might be simpler to just track max_lid, and do the +1 at the alloc site?
>
Sure, will do. Will also update the code to validate lid value accordingly.
Tim
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-20 3:25 ` Qais Yousef
@ 2026-02-21 2:48 ` Chen, Yu C
2026-02-24 3:11 ` Qais Yousef
0 siblings, 1 reply; 117+ messages in thread
From: Chen, Yu C @ 2026-02-21 2:48 UTC (permalink / raw)
To: Qais Yousef
Cc: Peter Zijlstra, Tim Chen, Ingo Molnar, K Prateek Nayak,
Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
Yangyu Chen, Tingyin Duan, Vern Hao, Vern Hao, Len Brown,
Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen,
Josh Don, Gavin Guo, Libo Chen, linux-kernel
On 2/20/2026 11:25 AM, Qais Yousef wrote:
> On 02/19/26 23:07, Chen, Yu C wrote:
>> Hi Peter, Qais,
>>
>> On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
>>> On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
>>>> On 02/10/26 14:18, Tim Chen wrote:
[ ... ]
>> became much more stable, so we kept it as is. The wakeup path more or less
>> aggregates the wakees(threads within the same process) within the LLC in the
>> wakeup fast path, so we have not changed it for now.
>
> How expensive is it to use the new push lb, which unifies the decision with
> wake up path, to detect these bad task placement and steer them back to the
> right LLC? I think if we can construct the trigger right, we can simplify the
> load balance to keep tagged tasks within the same LLC much easier. In my view
> this bad task placement is just a new type of misfit where a task has strayed
> from its group for whatever reason at wake up and it is not sleeping and waking
> up again to be placed back with its clan - assuming the conditions has changed
> to warrant the move - which the wake up path should handle anyway.
>
> FWIW, I have been experimenting to use push lb to keep regular LB off and rely
> solely on it to manage the important corner cases (including overloaded one)
> - and seeing *very* promising results. But the systems I work with are small
> compared to yours.
>
> But essentially if we can construct the system to keep the wakeup path (via
> regular sleep/wakeup cycle and push lb) maintain the system relatively balanced
> and delay regular LB for when we need to do large intervention, we can simplify
> the problem space significantly IMHO. If the LB had to kick in, then the delays
> of not finding enough bandwidth to run are larger than the delays of not
> sharing the hottest LLC. IOW, keep the regular LB as-is for true load balance
> and handle the small exceptions via natural sleep/wakeup cycle or push lb.
>
Leveraging push-lb for cache-aware task placement is interesting,
and we have considered it during LPC when Vincent and Prateek presented it.
It could be an enhancement to the basic cache-aware scheduling, IMO.
Tim has mentioned that in
https://lore.kernel.org/all/4514b6aef56d0ae144ebd56df9211c6599744633.camel@linux.intel.com/
a bouncing issue needs to be resolved if task wakeup and push-lb are
leveraged for cache-aware scheduling. They are very fast - so for
cache-aware
scheduling, it is possible that multiple invocations of
select_idle_sibling()
will find the same LLC suitable. Then multiple wakees are woken up on
that LLC,
causing over-aggregation. Later, when over-aggregation is detected, several
tasks are migrated out of the LLC, which makes the LLC eligible
again-and the
pattern repeats back and forth.
>>
>> Let me copy the changelog from the previous patch version:
>>
>> "
>> In previous versions, aggregation of tasks were done in the
>> wake up path, without making load balancing paths aware of
>> LLC (Last-Level-Cache) preference. This led to the following
>> problems:
>>
>> 1) Aggregation of tasks during wake up led to load imbalance
>> between LLCs
>> 2) Load balancing tried to even out the load between LLCs
>> 3) Wake up tasks aggregation happened at a faster rate and
>> load balancing moved tasks in opposite directions, leading
>> to continuous and excessive task migrations and regressions
>> in benchmarks like schbench.
>
> Note this is an artefact of tagging all tasks belonging to the process as
> co-dependent. So somehow this is a case of shooting one self in the foot
> because processes with large number of tasks will create large imbalances and
> will start to require special handling. I guess the question, were they really
> that packed which means the steering logic needed to relax a little bit and say
> hey, this is an overcommit I must spill to the other LLCs, or was it really
> okay to pack them all in one LLC and LB was overzealous to kick in and needed
> to be aware the new case is not really a problem that requires its
> intervention?
>
>>
>> In this version, load balancing is made cache-aware. The main
>> idea of cache-aware load balancing consists of two parts:
>
> I think this might work under the conditions you care about. But will be hard
> to generalize. But I might need to go and read more.
>
> Note I am mainly concerned because the wake up path can't stay based purely on
> load forever and need to be able to do smarter decisions (latency being the
> most important one in the horizon). And they all will hit this problem. I think
> we need to find a good recipe for how to handle these problems in general.
> I don't think we can extend the LB to be energy aware, latency aware, cache
> aware etc without hitting a lot of hurdles. And it is too slow to react.
>
>>
>> 1) Identify tasks that prefer to run on their hottest LLC and
>> move them there.
>> 2) Prevent generic load balancing from moving a task out of
>> its hottest LLC.
>
> Isn't this 2nd part the fix to the wake up problem you faced? 1 should
> naturally be happening at wake up. And for random long running strayed tasks,
> I believe push lb is an easier way to manage them.
This is doable and some logic needs to be added in wakeup/push lb to
avoid the bouncing issue mentioned above. Consider both whether do it
in task wakeup/push lb/generic lb, and the task tagging, I was thinking that
creating threads within one process appears to be a special case of tagging.
If the user chooses to create threads rather than forking new processes,
is it a higher potential for data sharing among those threads? However,
we agree that fine-grained tagging is necessary. How about this: if the
user explicitly tags tasks into a single group, the kernel can perform
aggressive task aggregation-for instance, in the wakeup/fair-push path - and
let the user accept the corresponding risks. For the default model, generic
load balancing can perform per-process task aggregation at a slower pace to
reduce the risk of false decisions and over-aggregation. We intended to
discuss
this in a separate thread, though.
Thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 09/21] sched/cache: Count tasks prefering destination LLC in a sched group
2026-02-20 13:43 ` Peter Zijlstra
@ 2026-02-21 2:53 ` Chen, Yu C
0 siblings, 0 replies; 117+ messages in thread
From: Chen, Yu C @ 2026-02-21 2:53 UTC (permalink / raw)
To: Peter Zijlstra, Tim Chen
Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel
On 2/20/2026 9:43 PM, Peter Zijlstra wrote:
> On Fri, Feb 20, 2026 at 01:52:48PM +0100, Peter Zijlstra wrote:
>> On Tue, Feb 10, 2026 at 02:18:49PM -0800, Tim Chen wrote:
>>
>>> @@ -11034,6 +11037,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>> {
>>> int i, nr_running, local_group, sd_flags = env->sd->flags;
>>> bool balancing_at_rd = !env->sd->parent;
>>> +#ifdef CONFIG_SCHED_CACHE
>>> + int dst_llc = llc_id(env->dst_cpu);
>>> +#endif
>>>
>>> memset(sgs, 0, sizeof(*sgs));
>>>
>>> @@ -11054,6 +11060,15 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>> if (cpu_overutilized(i))
>>> *sg_overutilized = 1;
>>>
>>> +#ifdef CONFIG_SCHED_CACHE
>>> + if (sched_cache_enabled() && llc_id(i) != dst_llc) {
>>
>> If you write that like:
>>
>> if (sched_cache_enabled && llc_id(i) != llc_id(env->dst_cpu))
>>
>> You can get rid of that dst_llc variable, but more importantly its
>> ifdeffery.
>
> Ah, you're perhaps wanting to not re-load on the dst_llc usage below?
Yes.
> Do the compilers DTRT when you mark llc_id() as __pure?
OK, we will have a try on this.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-20 9:43 ` Peter Zijlstra
@ 2026-02-24 2:49 ` Qais Yousef
0 siblings, 0 replies; 117+ messages in thread
From: Qais Yousef @ 2026-02-24 2:49 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tim Chen, Chen, Yu C, Ingo Molnar, K Prateek Nayak,
Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
Yangyu Chen, Tingyin Duan, Vern Hao, Vern Hao, Len Brown,
Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen,
Josh Don, Gavin Guo, Libo Chen, linux-kernel
On 02/20/26 10:43, Peter Zijlstra wrote:
> On Fri, Feb 20, 2026 at 03:29:41AM +0000, Qais Yousef wrote:
>
> > What's the reason wake up doesn't have the latest info? Is this a limitation of
> > these large systems where stats updates are too expensive to do? Is it not
> > fixable at all?
>
> Scalability is indeed the main problem. The periodic load-balancer, by
> virtue of being 'slow' has two advantages:
>
> - the cost of aggregating the numbers is amortized by the relative low
> frequency of aggregation
>
> - it can work with averages; it is less concerned with immediate
> spikes.
>
> This obviously has the exact inverse set of problems in that it is not
> able to deal with immediate/short term issues.
Yes. And if we are to focus on providing better task placement based on QoS
(which what I think this is essentially is), we have a constant problem of two
paths producing results that are incompatible. Which is why I am trying to
stress the importance of the wake up path. I understand for this initial drop
we don't have a way to provide specific hints for tasks, but this is why we end
up with this difficult choices always - which I think we don't have to.
More on this at the bottom.
>
>
> Anyway, we're already at the point where EAS wakeup path is getting far
> too expensive for the current set of hardware. While we started with a
> handful of asymmetric CPUs, we're now pushing 32 CPUs or so.
Is this 32 perf domains? Expensive for what workloads? Folks can still use
performance governor and plug it to a wall if they want ;-)
>
> (Look at Intel Nova Lake speculation online, that's supposedly going to
> get us 2 dies of 8P+16E with another 4 bonus weaklings on the south
> bridge or something, for a grand total of 52 asymmetric CPUs of 3 kinds)
Not sure if my experience matters for whatever this is supposed to be used for,
but the cost of wrong decision is really high on these topologies. It is bloody
worthwhile spending more time to select a better CPU and worthwhile to have the
push lb do frequent corrections. Not sure if you saw the other thread on one of
Vincent's patches - but I am trying to completely disable overutilized (or
regular LB) and rely on wakeup + push lb and seeing great success (and gains).
But I am carrying a number of improvements that I discussed in various places
on the list that makes this effective setup. Hopefully I'll share full findings
properly at some point.
>
>
> Then consider:
>
> - Intel Granite Rapids-SP at 8*86 cores for 688 cores / 1376 threads.
>
> - AMD Prometheus at 2*192 cores with 384 cores / 768 threads. These
> are silly number of CPUs.
>
> - Power10, it is something like 16 sockets, 16 cores per socket, 8
> threads per core for a mere 2048 threads.
>
> Now, these are the extreme end of the spectrum systems, 'nobody' will
> actually have them, but in a few generations they'll seem small again.
>
>
> So whatever we build now, will have to deal with silly numbers of CPUs.
True, but I think we ought to bite the bullet at some point. My line of
thought is that we don't have to (and actually shouldn't) make the compromise
at the kernel level. We can define the problem such that it is opt-in/opt-out
where users who find the benefit can opt-in or find a disadvantage opt-out.
Now the difficulty is that we don't have a way to describe such things, and
this is what I am trying to solve with Sched QoS library. I am writing this
now, but I think I should be able to help with this use case so that users can
describe which workload wants to benefit from co-locating and these tasks will
take the hit of harder task placement and frequent migration under loaded
scenarios - the contract being that being co-located has significant
performance impact they are happy to pay the price. Things that didn't
subscribe will work as-is.
Anyway, my major goal is to find how we can tie all these stories together as
we need to add ability to do task placement based on special requirements and
the conflict with LB is one major one that I think Vincent's proposal for push
lb is quite neat and spot on. I am not sure if you saw our LPC talk about Sched
QoS where we expanded on our overall thoughts.
In my view, this problem belongs to the same class of problems of placement
based on special requirements (latency, energy, cache etc) and hopefully we can
address along the way. But if not, it would be good to know more so we can
think how we can better incorporate it as part of the bigger story.
So far I think if this can be made to go through the wake up path and rely on
push lb; it is part of the same story. If not, then we need to think harder how
to connect things together for a coherent approach.
If I can successfully give you a way to describe the requirement of tasks needs
to be co-located so that we don't have to make the assumption in the kernel
that tasks belonging to the same process needs to stay in the same LLC, do you
think wake up + push lb works? If not, how do you see it evolving? And more
importantly, how do you view the role of regular LB in these cases? The way
I see it is that it should trigger less for the reasons you mentioned at the
top; and when it triggers it means heavy intervention is required and whatever
special task placement requirements will need to be dropped at this stage since
the push lb clearly failed to keep up and we are at a point where we need to do
heavy handed balancing work. I think this activities are more relevant to
multi-LLC systems - which has the added problem of defining when some
imbalances are okay; which I believe the difficulty being hit here with wakeup
path based approach. For single LLC systems I think this heavy handed approach
can be made unnecessary if we do it correctly.
Sorry a bit of divergent. But I am interested on how we can all move the ship
in the same direction. I think this is all part of making the wake up path
multi-modal and improve its co-ordination with LB.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-20 18:14 ` Tim Chen
@ 2026-02-24 3:02 ` Qais Yousef
0 siblings, 0 replies; 117+ messages in thread
From: Qais Yousef @ 2026-02-24 3:02 UTC (permalink / raw)
To: Tim Chen
Cc: Chen, Yu C, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
Yangyu Chen, Tingyin Duan, Vern Hao, Vern Hao, Len Brown,
Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen,
Josh Don, Gavin Guo, Libo Chen, linux-kernel
On 02/20/26 10:14, Tim Chen wrote:
> > > Another reason why we moved away from doing things in the wake up
> > > path is load imbalance consideration. Wake up path does not have
> > > the most up to date load information in the LLC sched domains as
> > > in the load balance path. So you may actually have everyone rushed
> >
> > What's the reason wake up doesn't have the latest info? Is this a limitation of
> > these large systems where stats updates are too expensive to do? Is it not
> > fixable at all?
>
> You will need to sum the load for each run queue for each LLC to get
> an accurate picture. That will be too expensive on the wake up path.
I am probably missing something obvious. But it seems enqueue/dequeue + TICK
are not keeping stats enough up-to-date for wakeup path to rely on. I need to
read this code more.
I could be wrong, but as I was trying to highlight in other places, I think the
fact we tag all tasks belonging to a process as needing to stay together is
exaggerating this problem. First every process is assumed to need to stay
within the same LLC, and every task within the process. The wake up path by
design now has a more difficult job and needs to look harder compared to if the
tagging was more conservative. And I can appreciate defining and teaching
regular LB that some imbalances are okay under these situations is hard. It is
sort of overcommit situation by design.
Anyway. As I was trying to tell Peter, I am trying to think how we can tie all
these similar stories together. I hope once we can provide sensible way to tag
tasks we can get wake up path + push lb to work easily as then we should have
a handful of tasks asking to co-locate which is much easier to manage.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-21 2:48 ` Chen, Yu C
@ 2026-02-24 3:11 ` Qais Yousef
0 siblings, 0 replies; 117+ messages in thread
From: Qais Yousef @ 2026-02-24 3:11 UTC (permalink / raw)
To: Chen, Yu C
Cc: Peter Zijlstra, Tim Chen, Ingo Molnar, K Prateek Nayak,
Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
Yangyu Chen, Tingyin Duan, Vern Hao, Vern Hao, Len Brown,
Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen,
Josh Don, Gavin Guo, Libo Chen, linux-kernel
On 02/21/26 10:48, Chen, Yu C wrote:
> Leveraging push-lb for cache-aware task placement is interesting,
> and we have considered it during LPC when Vincent and Prateek presented it.
> It could be an enhancement to the basic cache-aware scheduling, IMO.
> Tim has mentioned that in
> https://lore.kernel.org/all/4514b6aef56d0ae144ebd56df9211c6599744633.camel@linux.intel.com/
> a bouncing issue needs to be resolved if task wakeup and push-lb are
> leveraged for cache-aware scheduling. They are very fast - so for
> cache-aware
> scheduling, it is possible that multiple invocations of
> select_idle_sibling()
> will find the same LLC suitable. Then multiple wakees are woken up on that
> LLC,
> causing over-aggregation. Later, when over-aggregation is detected, several
> tasks are migrated out of the LLC, which makes the LLC eligible again-and
> the
> pattern repeats back and forth.
I believe this is a symptom of how tagging is currently happening. I think if
we have more conservative tagging approach this will be less of a problem. But
proof is in the pudding as they say :-)
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 00/21] Cache Aware Scheduling
2026-02-20 8:45 ` Peter Zijlstra
@ 2026-02-24 3:31 ` Qais Yousef
0 siblings, 0 replies; 117+ messages in thread
From: Qais Yousef @ 2026-02-24 3:31 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Libo Chen, linux-kernel
On 02/20/26 09:45, Peter Zijlstra wrote:
> On Fri, Feb 20, 2026 at 03:41:27AM +0000, Qais Yousef wrote:
>
> > IMHO I see people are constantly tripping over task placement being too simple
> > and need smarter decision making process.
>
> So at the same time we're always having trouble because its too
> expensive for some.
If they don't want it, they can turn it off with a simple debugfs/sched_feat
toggle? I think our way out of this dilemma is to make it their choice. You
know, many problems can disappear if you make it another person's problem :-)
Joking aside, I am trying to implement scheduler profiles in Sched QoS so
that users can pick throughput, interactive, etc and toggle few debugfs on
their behalf. Hopefully this will help abstract the problem while still
maintain our kernel development mostly as-is. I don't think we are forced into
a choice in many cases (at kernel level). But what do I know :-)
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-02-20 9:53 ` Peter Zijlstra
@ 2026-02-24 9:42 ` Madadi Vineeth Reddy
0 siblings, 0 replies; 117+ messages in thread
From: Madadi Vineeth Reddy @ 2026-02-24 9:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don, Gavin Guo,
Qais Yousef, Libo Chen, linux-kernel, Madadi Vineeth Reddy
Hi Peter,
Sorry for the delayed response. Wanted to be sure before responding.
On 20/02/26 15:23, Peter Zijlstra wrote:
> On Fri, Feb 20, 2026 at 12:10:21PM +0530, Madadi Vineeth Reddy wrote:
>> Hi Peter,
>>
>> On 19/02/26 22:25, Peter Zijlstra wrote:
>>> On Wed, Feb 18, 2026 at 11:24:05PM +0530, Madadi Vineeth Reddy wrote:
>>>> Is there a way to make this useful for architectures with small LLC
>>>> sizes? One possible approach we were exploring is to have LLC at a
>>>> hemisphere level that comprise multiple SMT4 cores.
>>>
>>> Is this hemisphere an actual physical cache level, or would that be
>>> artificial?
>>
>> It's artificial. There is no cache being shared at this level but this is
>> still the level where some amount of cache-snooping takes place and it is
>> relatively faster to access the data from the caches of the cores
>> within this domain.
>>
>> We verified with this producer consumer workload where the producer
>> and consumer threads placed in the same hemisphere showed measurably
>> better latency compared to cross-hemisphere placement.
>
> So I just read the Power10 Wikipedia entry; that seems to suggest there
> actually is a significant L3 at the hemisphere level.
>
> That thing states that Power10 has:
>
> - 16 cores in two hemispheres of 8 cores each.
> - each core has 2M L2 cache
> - each hemi has 64M of L3 cache
The Wikipedia entry is incorrect. On Power10, L3 is at the SMT4
small core level (4M per core), not at the hemisphere level. This
is documented in the Power10 user manual [1] (Page 175). L3 is
also a victim cache on Power10.
>
> Then there appears to be a 'funny' in that there's always one 'dead'
> core, so you end up with 8+7, and the small hemi looses an 8M L3 slice
> due to that.
>
> Now, I'm just reading a Wiki pages written by a random person on the
> interweb, so perhaps this is wrong (in which case I would suggest you
Yes, the Wikipedia page is wrong on this. We will get it corrected
with proper references.
[1] https://files.openpower.foundation/s/EgCy7C43p2NSRfR
Thanks,
Vineeth
> get someone from IBM to go and edit that page and provide references),
> or there has been a miscommunication somewhere else, and perhaps there
> really is L3 at the hemi level, and arch/powerpc/ 'forgot' to expose
> that :-)
^ permalink raw reply [flat|nested] 117+ messages in thread
end of thread, other threads:[~2026-02-24 9:43 UTC | newest]
Thread overview: 117+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-10 22:18 [PATCH v3 00/21] Cache Aware Scheduling Tim Chen
2026-02-10 22:18 ` [PATCH v3 01/21] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
2026-02-14 12:26 ` Madadi Vineeth Reddy
2026-02-14 15:34 ` Chen, Yu C
2026-02-17 18:51 ` Tim Chen
2026-02-10 22:18 ` [PATCH v3 02/21] sched/cache: Record per LLC utilization to guide cache aware scheduling decisions Tim Chen
2026-02-10 22:18 ` [PATCH v3 03/21] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
2026-02-14 16:12 ` Madadi Vineeth Reddy
2026-02-15 12:14 ` Chen, Yu C
2026-02-19 11:29 ` Peter Zijlstra
2026-02-19 14:48 ` Chen, Yu C
2026-02-19 14:55 ` Peter Zijlstra
2026-02-10 22:18 ` [PATCH v3 04/21] sched/cache: Make LLC id continuous Tim Chen
2026-02-14 17:53 ` Madadi Vineeth Reddy
2026-02-15 14:25 ` Chen, Yu C
2026-02-17 10:05 ` Madadi Vineeth Reddy
2026-02-17 21:20 ` Tim Chen
2026-02-16 7:44 ` K Prateek Nayak
2026-02-17 6:07 ` Chen, Yu C
2026-02-17 8:09 ` K Prateek Nayak
2026-02-17 23:12 ` Tim Chen
2026-02-18 3:28 ` K Prateek Nayak
2026-02-18 15:22 ` Chen, Yu C
2026-02-18 17:46 ` K Prateek Nayak
2026-02-18 23:21 ` Tim Chen
2026-02-19 6:12 ` K Prateek Nayak
2026-02-19 15:51 ` Peter Zijlstra
2026-02-20 0:11 ` Tim Chen
2026-02-19 11:25 ` Chen, Yu C
2026-02-19 16:10 ` K Prateek Nayak
2026-02-18 18:45 ` Tim Chen
2026-02-18 21:33 ` Tim Chen
2026-02-18 15:11 ` Chen, Yu C
2026-02-19 15:48 ` Peter Zijlstra
2026-02-20 15:22 ` Chen, Yu C
2026-02-19 15:40 ` Peter Zijlstra
2026-02-20 15:53 ` Chen, Yu C
2026-02-20 16:03 ` Peter Zijlstra
2026-02-20 16:10 ` Chen, Yu C
2026-02-20 19:24 ` Tim Chen
2026-02-20 19:30 ` Peter Zijlstra
2026-02-20 19:35 ` Tim Chen
2026-02-19 11:35 ` Peter Zijlstra
2026-02-19 18:17 ` Tim Chen
2026-02-19 14:59 ` Peter Zijlstra
2026-02-19 15:20 ` Chen, Yu C
2026-02-19 19:20 ` Tim Chen
2026-02-19 21:04 ` Tim Chen
2026-02-20 17:17 ` Chen, Yu C
2026-02-10 22:18 ` [PATCH v3 05/21] sched/cache: Assign preferred LLC ID to processes Tim Chen
2026-02-14 18:36 ` Madadi Vineeth Reddy
2026-02-16 6:58 ` Chen, Yu C
2026-02-10 22:18 ` [PATCH v3 06/21] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
2026-02-10 22:18 ` [PATCH v3 07/21] sched/cache: Introduce per CPU's tasks LLC preference counter Tim Chen
2026-02-20 10:45 ` Peter Zijlstra
2026-02-20 16:57 ` Chen, Yu C
2026-02-20 18:38 ` Peter Zijlstra
2026-02-10 22:18 ` [PATCH v3 08/21] sched/cache: Calculate the percpu sd task LLC preference Tim Chen
2026-02-20 11:02 ` Peter Zijlstra
2026-02-20 14:02 ` Peter Zijlstra
2026-02-20 17:25 ` Chen, Yu C
2026-02-10 22:18 ` [PATCH v3 09/21] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
2026-02-20 12:52 ` Peter Zijlstra
2026-02-20 13:43 ` Peter Zijlstra
2026-02-21 2:53 ` Chen, Yu C
2026-02-10 22:18 ` [PATCH v3 10/21] sched/cache: Check local_group only once in update_sg_lb_stats() Tim Chen
2026-02-10 22:18 ` [PATCH v3 11/21] sched/cache: Prioritize tasks preferring destination LLC during balancing Tim Chen
2026-02-17 18:33 ` Madadi Vineeth Reddy
2026-02-17 21:45 ` Tim Chen
2026-02-10 22:18 ` [PATCH v3 12/21] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
2026-02-10 22:18 ` [PATCH v3 13/21] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
2026-02-17 19:00 ` Madadi Vineeth Reddy
2026-02-17 22:04 ` Tim Chen
2026-02-20 13:53 ` Peter Zijlstra
2026-02-20 18:22 ` Tim Chen
2026-02-10 22:18 ` [PATCH v3 14/21] sched/cache: Respect LLC preference in task migration and detach Tim Chen
2026-02-18 9:14 ` Madadi Vineeth Reddy
2026-02-18 15:34 ` Chen, Yu C
2026-02-10 22:18 ` [PATCH v3 15/21] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
2026-02-18 17:54 ` Madadi Vineeth Reddy
2026-02-18 21:44 ` Tim Chen
2026-02-19 2:28 ` Madadi Vineeth Reddy
2026-02-19 14:38 ` Chen, Yu C
2026-02-19 21:12 ` Tim Chen
2026-02-19 16:52 ` Peter Zijlstra
2026-02-20 7:02 ` Madadi Vineeth Reddy
2026-02-19 16:55 ` Peter Zijlstra
2026-02-20 6:40 ` Madadi Vineeth Reddy
2026-02-20 9:53 ` Peter Zijlstra
2026-02-24 9:42 ` Madadi Vineeth Reddy
2026-02-19 16:50 ` Peter Zijlstra
2026-02-19 21:06 ` Tim Chen
2026-02-10 22:18 ` [PATCH v3 16/21] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
2026-02-10 22:18 ` [PATCH v3 17/21] sched/cache: Enable cache aware scheduling for multi LLCs NUMA node Tim Chen
2026-02-10 22:18 ` [PATCH v3 18/21] sched/cache: Allow the user space to turn on and off cache aware scheduling Tim Chen
2026-02-10 22:18 ` [PATCH v3 19/21] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Tim Chen
2026-02-20 14:29 ` Peter Zijlstra
2026-02-20 18:18 ` Tim Chen
2026-02-10 22:19 ` [PATCH v3 20/21] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Tim Chen
2026-02-10 22:19 ` [PATCH v3 21/21] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Tim Chen
2026-02-19 14:08 ` [PATCH v3 00/21] Cache Aware Scheduling Qais Yousef
2026-02-19 14:41 ` Peter Zijlstra
2026-02-19 15:07 ` Chen, Yu C
2026-02-19 18:11 ` Tim Chen
2026-02-20 3:29 ` Qais Yousef
2026-02-20 9:43 ` Peter Zijlstra
2026-02-24 2:49 ` Qais Yousef
2026-02-20 18:14 ` Tim Chen
2026-02-24 3:02 ` Qais Yousef
2026-02-20 3:25 ` Qais Yousef
2026-02-21 2:48 ` Chen, Yu C
2026-02-24 3:11 ` Qais Yousef
2026-02-19 19:48 ` Qais Yousef
2026-02-19 21:47 ` Tim Chen
2026-02-20 3:41 ` Qais Yousef
2026-02-20 8:45 ` Peter Zijlstra
2026-02-24 3:31 ` Qais Yousef
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox