[Patch v4 00/22] Cache aware scheduling

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [Patch v4 00/22] Cache aware scheduling
@ 2026-04-01 21:52 Tim Chen
  2026-04-01 21:52 ` [Patch v4 01/22] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
                   ` (21 more replies)
  0 siblings, 22 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

This patch series introduces infrastructure for cache-aware load
balancing, with the goal of co-locating tasks that share data within
the same Last Level Cache (LLC) domain. By improving cache locality,
the scheduler can reduce cache bouncing and cache misses, ultimately
improving data access efficiency. The design builds on the initial
prototype from Peter [1].

This initial implementation treats threads within the same process
as entities that are likely to share data. During load balancing, the
scheduler attempts to aggregate such threads onto the same LLC domain
whenever possible.

Most of the feedback received on v3 has been addressed. Some aspects
could be enhanced later after the basic cache-aware portion has landed:

There were discussions around grouping tasks using mechanisms other
than process membership. While we agree that more flexible grouping
is desirable, this series intentionally focuses on establishing basic
process-based grouping first, with alternative grouping mechanisms to
be explored in a follow-on series.

There was also discussion in v3 that the task wakeup path should be used
to perform cache-aware scheduling. According to previous test results,
performing task aggregation in the wakeup path introduced task migration
bouncing. Primarily that was due to the wake up path not having the up
to date LLC load information.  That led to over-aggregation that needed
to be corrected later in load balancing. Load balancing path was chosen
as the conservative path to perform task aggregation. The task wakeup
path will be investigated as a future enhancement.

Furthermore, there was also requests to make cache-aware scheduling
benefit small LLC systems. Peter suggested using an llc-mask instead of
a single llc value for preferences[2]. This could also be implemented
as a future enhancement.

The cache aware load balancing logic remains largely unchanged. The
significant changes in v4 are:

1. LLC ID management: the calculation of the LLC ID switches to using
bitmap allocation rather than maintaining a static value.
2. Introduce a new patch [2/22] to limit the CPU scan span with
preferred NUMA node when NUMA balancing is enabled.
3. Tweaks in load balance failure considerations where
keeping load imbalance at low load and not pulling task from preferred
LLC is not considered as a balance failure.

Other changes are described in each patch.

Test results:
The patch series was applied and tested on v7.0-rc3.
Git tree can be found here:
https://github.com/timcchen1298/linux/tree/cache_aware_v4

The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.

The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.

hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched
on these two platforms.

[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows limitted wakeup latency improvement.
ChaCha20-xiangshan(risc-v simulator) shows good throughput
improvement. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Genoa:
Significant improvement is observed in hackbench when
the active number of threads is lower than the number
of CPUs within 1 LLC. On v2, Aaron reported improvement
of hackbench/redis when system is underloaded.
ChaCha20-xiangshan shows huge throughput improvement.
Phoronix has tested v1 and shows good improvements in 30+
cases[3]. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Detail:
To conserve space, data without much difference with
baseline is not presented.

Sapphire Rapids:
[hackbench pipe]
================
case                    load            baseline(std%)  compare%( std%)
threads-pipe-10         1-groups         1.00 (  1.22)  +26.09 (  1.10)
threads-pipe-10         2-groups         1.00 (  4.90)  +22.88 (  0.18)
threads-pipe-10         4-groups         1.00 (  2.07)   +9.00 (  3.49)
threads-pipe-10         8-groups         1.00 (  8.13)   +3.45 (  3.62)
threads-pipe-16         1-groups         1.00 (  2.11)  +26.30 (  0.08)
threads-pipe-16         2-groups         1.00 ( 15.13)   -1.77 ( 11.89)
threads-pipe-16         4-groups         1.00 (  4.37)   +0.58 (  7.99)
threads-pipe-16         8-groups         1.00 (  2.88)   +2.71 (  3.50)
threads-pipe-2          1-groups         1.00 (  9.40)  +22.07 (  0.71)
threads-pipe-2          2-groups         1.00 (  9.99)  +18.01 (  0.95)
threads-pipe-2          4-groups         1.00 (  3.98)  +24.66 (  0.96)
threads-pipe-2          8-groups         1.00 (  7.00)  +21.83 (  0.23)
threads-pipe-20         1-groups         1.00 (  1.03)  +28.84 (  0.21)
threads-pipe-20         2-groups         1.00 (  4.42)  +31.90 (  3.15)
threads-pipe-20         4-groups         1.00 (  9.97)   +4.56 (  1.69)
threads-pipe-20         8-groups         1.00 (  1.87)   +1.25 (  0.74)
threads-pipe-4          1-groups         1.00 (  4.48)  +25.67 (  0.78)
threads-pipe-4          2-groups         1.00 (  9.14)   +4.91 (  2.08)
threads-pipe-4          4-groups         1.00 (  7.68)  +19.36 (  1.53)
threads-pipe-4          8-groups         1.00 ( 10.79)   +7.20 ( 12.20)
threads-pipe-8          1-groups         1.00 (  4.69)  +21.93 (  0.03)
threads-pipe-8          2-groups         1.00 (  1.16)  +25.29 (  0.65)
threads-pipe-8          4-groups         1.00 (  2.23)   -1.27 (  3.62)
threads-pipe-8          8-groups         1.00 (  4.65)   -3.08 (  2.75)

Note: The default number of fd in hackbench is changed from 20 to various
values to ensure that threads fit within a single LLC, especially on AMD
systems. Take "threads-pipe-8, 2-groups" for example, the number of fd
is 8, and 2 groups are created.

[schbench]
The 99th percentile wakeup latency shows some improvements when the
system is underload, while it does not bring much difference with
the increasing of system utilization.

99th Wakeup Latencies	Base (mean std)      Compare (mean std)   Change
--------------------------------------------------------------------------------
thread=2                 9.00(0.00)           9.00(1.73)           0.00%
thread=4                 7.33(0.58)           6.33(0.58)           +13.64%
thread=8                 9.00(0.00)           7.67(1.15)           +14.78%   
thread=16                8.67(0.58)           8.67(1.53)           0.00%     
thread=32                9.00(0.00)           7.00(0.00)           +22.22%   
thread=64                9.33(0.58)           9.67(0.58)           -3.64%    
thread=128              12.00(0.00)          12.00(0.00)           0.00%

[chacha200]
baseline:
Host time spent: 67861ms
cache aware scheduling enabled:
Host time spent: 54441ms

Time reduced by 24%

Genoa:
[hackbench pipe]
The default number of fd is 20, which exceed the number of CPUs
in a LLC. So the fd is adjusted to 2, 4, 6, 8, 20 respectively.
Exclude the result with large run-to-run variance, 10% ~ 50%
improvement is observed when the system is underloaded:

[hackbench pipe]
================
case                    load            baseline(std%)  compare%( std%)
threads-pipe-2          1-groups         1.00 (  2.89)  +47.33 (  1.20)
threads-pipe-2          2-groups         1.00 (  3.88)  +39.82 (  0.61)
threads-pipe-2          4-groups         1.00 (  8.76)   +5.57 ( 13.10)
threads-pipe-20         1-groups         1.00 (  4.61)  +11.72 (  1.06)
threads-pipe-20         2-groups         1.00 (  6.18)  +14.55 (  1.47)
threads-pipe-20         4-groups         1.00 (  2.99)  +10.16 (  4.49)
threads-pipe-4          1-groups         1.00 (  4.23)  +43.70 (  2.14)
threads-pipe-4          2-groups         1.00 (  3.68)   +8.45 (  4.04)
threads-pipe-4          4-groups         1.00 ( 17.72)   +2.42 (  1.14)
threads-pipe-6          1-groups         1.00 (  3.10)   +7.74 (  3.83)
threads-pipe-6          2-groups         1.00 (  3.42)  +14.26 (  4.53)
threads-pipe-6          4-groups         1.00 ( 10.34)  +10.94 (  7.12)
threads-pipe-8          1-groups         1.00 (  4.21)   +9.06 (  4.43)
threads-pipe-8          2-groups         1.00 (  1.88)   +3.74 (  0.58)
threads-pipe-8          4-groups         1.00 (  2.78)  +23.96 (  1.18)

[chacha200]
Host time spent: 54762ms
Host time spent: 28295ms

Time reduced by 48%

[1] https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/
[2] https://lore.kernel.org/all/20260219165221.GM1395266@noisy.programming.kicks-ass.net/
[3] https://www.phoronix.com/review/cache-aware-scheduling-amd-turin

Change history:
**v4 Changes:**
1. Using bitmap based LLC id dynamic allocation mechanism.
2. Introduce a new patch [2/22] to limit the CPU scan depth with
   preferred NUMA node.
3. Keeping load imbalance at low load and not pulling task from preferred
LLC is not considered as a balance failure.
4. Other changes from v3 are detailed in each patch's change log.

**v3 Changes:**
v3 link: https://lore.kernel.org/all/cover.1770760558.git.tim.c.chen@linux.intel.com/
1. Cache-aware scheduling is skipped after repeated load balance
   failures (up to cache_nice_tries). This avoids repeatedly attempting
   cache-aware migrations when no movable tasks prefer the destination
   LLC.
2. The busiest runqueue is no longer sorted to select tasks that prefer
   the destination LLC. This sorting was costly, and equivalent
   behavior can be achieved by skipping tasks that do not prefer the
   destination LLC during cache-aware migrations.
3. Accounting of the number of tasks preferring each LLC is now kept in
   the lowest-level sched domain per CPU. This simplifies handling of
   LLC resizing and changes in the number of LLC domains.
4. Other changes from v2 are detailed in each patch's change log.

**v2 Changes:**
v2 link: https://lore.kernel.org/all/cover.1764801860.git.tim.c.chen@linux.intel.com/
1. Align NUMA balancing and cache affinity by
   prioritizing NUMA balancing when their decisions differ.
2. Dynamically resize per-LLC statistics structures based on the LLC
   size.
3. Switch to a contiguous LLC-ID space so these IDs can be used
   directly as array indices for LLC statistics.
4. Add clarification comments.
5. Add 3 debug patches (not meant for merging).
6. Other changes to address feedbacks from review of v1 patch set
   (see individual patch change log).

**v1**
v1 link: https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/

Chen Yu (10):
  sched/cache: Limit the scan number of CPUs when calculating task
    occupancy
  sched/cache: Record per LLC utilization to guide cache aware
    scheduling decisions
  sched/cache: Introduce helper functions to enforce LLC migration
    policy
  sched/cache: Disable cache aware scheduling for processes with high
    thread counts
  sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  sched/cache: Enable cache aware scheduling for multi LLCs NUMA node
  sched/cache: Allow the user space to turn on and off cache aware
    scheduling
  sched/cache: Add user control to adjust the aggressiveness of
    cache-aware scheduling
  -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy
    for each process via proc fs
  -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load
    balance statistics

Peter Zijlstra (Intel) (1):
  sched/cache: Introduce infrastructure for cache-aware load balancing

Tim Chen (11):
  sched/cache: Make LLC id continuous
  sched/cache: Assign preferred LLC ID to processes
  sched/cache: Track LLC-preferred tasks per runqueue
  sched/cache: Introduce per CPU's tasks LLC preference counter
  sched/cache: Calculate the percpu sd task LLC preference
  sched/cache: Count tasks prefering destination LLC in a sched group
  sched/cache: Check local_group only once in update_sg_lb_stats()
  sched/cache: Prioritize tasks preferring destination LLC during
    balancing
  sched/cache: Add migrate_llc_task migration type for cache-aware
    balancing
  sched/cache: Handle moving single tasks to/from their preferred LLC
  sched/cache: Respect LLC preference in task migration and detach

 fs/proc/base.c                 |   31 +
 include/linux/cacheinfo.h      |   21 +-
 include/linux/mm_types.h       |   43 ++
 include/linux/sched.h          |   32 +
 include/linux/sched/topology.h |   17 +
 include/trace/events/sched.h   |  140 ++++
 init/Kconfig                   |   11 +
 init/init_task.c               |    3 +
 kernel/fork.c                  |    6 +
 kernel/sched/core.c            |   13 +
 kernel/sched/debug.c           |   58 +-
 kernel/sched/fair.c            | 1180 +++++++++++++++++++++++++++++++-
 kernel/sched/sched.h           |   50 ++
 kernel/sched/topology.c        |  234 ++++++-
 14 files changed, 1810 insertions(+), 29 deletions(-)

-- 
2.32.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Patch v4 01/22] sched/cache: Introduce infrastructure for cache-aware load balancing
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 02/22] sched/cache: Limit the scan number of CPUs when calculating task occupancy Tim Chen
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

From: "Peter Zijlstra (Intel)" <peterz@infradead.org>

Adds infrastructure to enable cache-aware load balancing,
which improves cache locality by grouping tasks that share resources
within the same cache domain. This reduces cache misses and improves
overall data access efficiency.

In this initial implementation, threads belonging to the same process
are treated as entities that likely share working sets. The mechanism
tracks per-process CPU occupancy across cache domains and attempts to
migrate threads toward cache-hot domains where their process already
has active threads, thereby enhancing locality.

This provides a basic model for cache affinity. While the current code
targets the last-level cache (LLC), the approach could be extended to
other domain types such as clusters (L2) or node-internal groupings.

At present, the mechanism selects the CPU within an LLC that has the
highest recent runtime. Subsequent patches in this series will use this
information in the load-balancing path to guide task placement toward
preferred LLCs.

In the future, more advanced policies could be integrated through NUMA
balancing-for example, migrating a task to its preferred LLC when spare
capacity exists, or swapping tasks across LLCs to improve cache affinity.
Grouping of tasks could also be generalized from that of a process
to be that of a NUMA group, or be user configurable.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
       No change.

 include/linux/mm_types.h |  32 +++++
 include/linux/sched.h    |  24 ++++
 init/Kconfig             |  11 ++
 kernel/fork.c            |   6 +
 kernel/sched/core.c      |   6 +
 kernel/sched/fair.c      | 266 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h     |  14 +++
 7 files changed, 359 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3cc8ae722886..67b2dfcc71ea 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1173,6 +1173,8 @@ struct mm_struct {
 		/* MM CID related storage */
 		struct mm_mm_cid mm_cid;
 
+		/* sched_cache related statistics */
+		struct sched_cache_stat sc_stat;
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
 #endif
@@ -1575,6 +1577,36 @@ static inline unsigned int mm_cid_size(void)
 # define MM_CID_STATIC_SIZE	0
 #endif /* CONFIG_SCHED_MM_CID */
 
+#ifdef CONFIG_SCHED_CACHE
+void mm_init_sched(struct mm_struct *mm,
+		   struct sched_cache_time __percpu *pcpu_sched);
+
+static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
+{
+	struct sched_cache_time __percpu *pcpu_sched =
+		alloc_percpu_noprof(struct sched_cache_time);
+
+	if (!pcpu_sched)
+		return -ENOMEM;
+
+	mm_init_sched(mm, pcpu_sched);
+	return 0;
+}
+
+#define mm_alloc_sched(...)	alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__))
+
+static inline void mm_destroy_sched(struct mm_struct *mm)
+{
+	free_percpu(mm->sc_stat.pcpu_sched);
+	mm->sc_stat.pcpu_sched = NULL;
+}
+#else /* !CONFIG_SCHED_CACHE */
+
+static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; }
+static inline void mm_destroy_sched(struct mm_struct *mm) { }
+
+#endif /* CONFIG_SCHED_CACHE */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a7b4a980eb2f..bd33f5b9096b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1406,6 +1406,10 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_SCHED_CACHE
+	struct callback_head		cache_work;
+#endif
+
 	struct rseq_data		rseq;
 	struct sched_mm_cid		mm_cid;
 
@@ -2376,6 +2380,26 @@ static __always_inline int task_mm_cid(struct task_struct *t)
 }
 #endif
 
+#ifdef CONFIG_SCHED_CACHE
+
+struct sched_cache_time {
+	u64 runtime;
+	unsigned long epoch;
+};
+
+struct sched_cache_stat {
+	struct sched_cache_time __percpu *pcpu_sched;
+	raw_spinlock_t lock;
+	unsigned long epoch;
+	int cpu;
+} ____cacheline_aligned_in_smp;
+
+#else
+
+struct sched_cache_stat { };
+
+#endif
+
 #ifndef MODULE
 #ifndef COMPILE_OFFSETS
 
diff --git a/init/Kconfig b/init/Kconfig
index 444ce811ea67..d1f3579d6ea4 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1005,6 +1005,17 @@ config NUMA_BALANCING
 
 	  This system will be inactive on UMA systems.
 
+config SCHED_CACHE
+	bool "Cache aware load balance"
+	default y
+	depends on SMP
+	help
+	  When enabled, the scheduler will attempt to aggregate tasks from
+	  the same process onto a single Last Level Cache (LLC) domain when
+	  possible. This improves cache locality by keeping tasks that share
+	  resources within the same cache domain, reducing cache misses and
+	  lowering data access latency.
+
 config NUMA_BALANCING_DEFAULT_ENABLED
 	bool "Automatically enable NUMA aware memory/task placement"
 	default y
diff --git a/kernel/fork.c b/kernel/fork.c
index 65113a304518..98ef5c997cc3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -724,6 +724,7 @@ void __mmdrop(struct mm_struct *mm)
 	cleanup_lazy_tlbs(mm);
 
 	WARN_ON_ONCE(mm == current->active_mm);
+	mm_destroy_sched(mm);
 	mm_free_pgd(mm);
 	mm_free_id(mm);
 	destroy_context(mm);
@@ -1124,6 +1125,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_cid(mm, p))
 		goto fail_cid;
 
+	if (mm_alloc_sched(mm))
+		goto fail_sched;
+
 	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
 				     NR_MM_COUNTERS))
 		goto fail_pcpu;
@@ -1133,6 +1137,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	return mm;
 
 fail_pcpu:
+	mm_destroy_sched(mm);
+fail_sched:
 	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7f77c165a6e..eff8695000e7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4437,6 +4437,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
 	init_numa_balancing(clone_flags, p);
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
+	init_sched_mm(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -8749,6 +8750,11 @@ void __init sched_init(void)
 
 		rq->core_cookie = 0UL;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		raw_spin_lock_init(&rq->cpu_epoch_lock);
+		rq->cpu_epoch_next = jiffies;
+#endif
+
 		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf948db905ed..eb3cfb852a93 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1228,6 +1228,8 @@ void post_init_entity_util_avg(struct task_struct *p)
 	sa->runnable_avg = sa->util_avg;
 }
 
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec);
+
 static s64 update_se(struct rq *rq, struct sched_entity *se)
 {
 	u64 now = rq_clock_task(rq);
@@ -1250,6 +1252,7 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 
 		trace_sched_stat_runtime(running, delta_exec);
 		account_group_exec_runtime(running, delta_exec);
+		account_mm_sched(rq, running, delta_exec);
 
 		/* cgroup time is always accounted against the donor */
 		cgroup_account_cputime(donor, delta_exec);
@@ -1271,6 +1274,267 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 
 static void set_next_buddy(struct sched_entity *se);
 
+#ifdef CONFIG_SCHED_CACHE
+
+/*
+ * XXX numbers come from a place the sun don't shine -- probably wants to be SD
+ * tunable or so.
+ */
+#define EPOCH_PERIOD	(HZ / 100)	/* 10 ms */
+#define EPOCH_LLC_AFFINITY_TIMEOUT	5	/* 50 ms */
+
+static int llc_id(int cpu)
+{
+	if (cpu < 0)
+		return -1;
+
+	return per_cpu(sd_llc_id, cpu);
+}
+
+void mm_init_sched(struct mm_struct *mm,
+		   struct sched_cache_time __percpu *_pcpu_sched)
+{
+	unsigned long epoch = 0;
+	int i;
+
+	for_each_possible_cpu(i) {
+		struct sched_cache_time *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
+		struct rq *rq = cpu_rq(i);
+
+		pcpu_sched->runtime = 0;
+		/* a slightly stale cpu epoch is acceptible */
+		pcpu_sched->epoch = rq->cpu_epoch;
+		epoch = rq->cpu_epoch;
+	}
+
+	raw_spin_lock_init(&mm->sc_stat.lock);
+	mm->sc_stat.epoch = epoch;
+	mm->sc_stat.cpu = -1;
+
+	/*
+	 * The update to mm->sc_stat should not be reordered
+	 * before initialization to mm's other fields, in case
+	 * the readers may get invalid mm_sched_epoch, etc.
+	 */
+	smp_store_release(&mm->sc_stat.pcpu_sched, _pcpu_sched);
+}
+
+/* because why would C be fully specified */
+static __always_inline void __shr_u64(u64 *val, unsigned int n)
+{
+	if (n >= 64) {
+		*val = 0;
+		return;
+	}
+	*val >>= n;
+}
+
+static inline void __update_mm_sched(struct rq *rq,
+				     struct sched_cache_time *pcpu_sched)
+{
+	lockdep_assert_held(&rq->cpu_epoch_lock);
+
+	unsigned long n, now = jiffies;
+	long delta = now - rq->cpu_epoch_next;
+
+	if (delta > 0) {
+		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		rq->cpu_epoch += n;
+		rq->cpu_epoch_next += n * EPOCH_PERIOD;
+		__shr_u64(&rq->cpu_runtime, n);
+	}
+
+	n = rq->cpu_epoch - pcpu_sched->epoch;
+	if (n) {
+		pcpu_sched->epoch += n;
+		__shr_u64(&pcpu_sched->runtime, n);
+	}
+}
+
+static unsigned long fraction_mm_sched(struct rq *rq,
+				       struct sched_cache_time *pcpu_sched)
+{
+	guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
+
+	__update_mm_sched(rq, pcpu_sched);
+
+	/*
+	 * Runtime is a geometric series (r=0.5) and as such will sum to twice
+	 * the accumulation period, this means the multiplcation here should
+	 * not overflow.
+	 */
+	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
+}
+
+static inline
+void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
+{
+	struct sched_cache_time *pcpu_sched;
+	struct mm_struct *mm = p->mm;
+	unsigned long epoch;
+
+	if (!sched_cache_enabled())
+		return;
+
+	if (p->sched_class != &fair_sched_class)
+		return;
+	/*
+	 * init_task, kthreads and user thread created
+	 * by user_mode_thread() don't have mm.
+	 */
+	if (!mm || !mm->sc_stat.pcpu_sched)
+		return;
+
+	pcpu_sched = per_cpu_ptr(p->mm->sc_stat.pcpu_sched, cpu_of(rq));
+
+	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
+		__update_mm_sched(rq, pcpu_sched);
+		pcpu_sched->runtime += delta_exec;
+		rq->cpu_runtime += delta_exec;
+		epoch = rq->cpu_epoch;
+	}
+
+	/*
+	 * If this process hasn't hit task_cache_work() for a while, or it
+	 * has only 1 thread, invalidate its preferred state.
+	 */
+	if (time_after(epoch,
+		       READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) ||
+	    get_nr_threads(p) <= 1) {
+		if (mm->sc_stat.cpu != -1)
+			mm->sc_stat.cpu = -1;
+	}
+}
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p)
+{
+	struct callback_head *work = &p->cache_work;
+	struct mm_struct *mm = p->mm;
+	unsigned long epoch;
+
+	if (!sched_cache_enabled())
+		return;
+
+	if (!mm || !mm->sc_stat.pcpu_sched)
+		return;
+
+	epoch = rq->cpu_epoch;
+	/* avoid moving backwards */
+	if (time_after_eq(mm->sc_stat.epoch, epoch))
+		return;
+
+	guard(raw_spinlock)(&mm->sc_stat.lock);
+
+	if (work->next == work) {
+		task_work_add(p, work, TWA_RESUME);
+		WRITE_ONCE(mm->sc_stat.epoch, epoch);
+	}
+}
+
+static void task_cache_work(struct callback_head *work)
+{
+	struct task_struct *p = current;
+	struct mm_struct *mm = p->mm;
+	unsigned long m_a_occ = 0;
+	unsigned long curr_m_a_occ = 0;
+	int cpu, m_a_cpu = -1;
+	cpumask_var_t cpus;
+
+	WARN_ON_ONCE(work != &p->cache_work);
+
+	work->next = work;
+
+	if (p->flags & PF_EXITING)
+		return;
+
+	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
+		return;
+
+	scoped_guard (cpus_read_lock) {
+		cpumask_copy(cpus, cpu_online_mask);
+
+		for_each_cpu(cpu, cpus) {
+			/* XXX sched_cluster_active */
+			struct sched_domain *sd = per_cpu(sd_llc, cpu);
+			unsigned long occ, m_occ = 0, a_occ = 0;
+			int m_cpu = -1, i;
+
+			if (!sd)
+				continue;
+
+			for_each_cpu(i, sched_domain_span(sd)) {
+				occ = fraction_mm_sched(cpu_rq(i),
+							per_cpu_ptr(mm->sc_stat.pcpu_sched, i));
+				a_occ += occ;
+				if (occ > m_occ) {
+					m_occ = occ;
+					m_cpu = i;
+				}
+			}
+
+			/*
+			 * Compare the accumulated occupancy of each LLC. The
+			 * reason for using accumulated occupancy rather than average
+			 * per CPU occupancy is that it works better in asymmetric LLC
+			 * scenarios.
+			 * For example, if there are 2 threads in a 4CPU LLC and 3
+			 * threads in an 8CPU LLC, it might be better to choose the one
+			 * with 3 threads. However, this would not be the case if the
+			 * occupancy is divided by the number of CPUs in an LLC (i.e.,
+			 * if average per CPU occupancy is used).
+			 * Besides, NUMA balancing fault statistics behave similarly:
+			 * the total number of faults per node is compared rather than
+			 * the average number of faults per CPU. This strategy is also
+			 * followed here.
+			 */
+			if (a_occ > m_a_occ) {
+				m_a_occ = a_occ;
+				m_a_cpu = m_cpu;
+			}
+
+			if (llc_id(cpu) == llc_id(mm->sc_stat.cpu))
+				curr_m_a_occ = a_occ;
+
+			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+		}
+	}
+
+	if (m_a_occ > (2 * curr_m_a_occ)) {
+		/*
+		 * Avoid switching sc_stat.cpu too fast.
+		 * The reason to choose 2X is because:
+		 * 1. It is better to keep the preferred LLC stable,
+		 *    rather than changing it frequently and cause migrations
+		 * 2. 2X means the new preferred LLC has at least 1 more
+		 *    busy CPU than the old one(200% vs 100%, eg)
+		 * 3. 2X is chosen based on test results, as it delivers
+		 *    the optimal performance gain so far.
+		 */
+		mm->sc_stat.cpu = m_a_cpu;
+	}
+
+	free_cpumask_var(cpus);
+}
+
+void init_sched_mm(struct task_struct *p)
+{
+	struct callback_head *work = &p->cache_work;
+
+	init_task_work(work, task_cache_work);
+	work->next = work;
+}
+
+#else /* CONFIG_SCHED_CACHE */
+
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
+				    s64 delta_exec) { }
+
+void init_sched_mm(struct task_struct *p) { }
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
+
+#endif /* CONFIG_SCHED_CACHE */
+
 /*
  * Used by other classes to account runtime.
  */
@@ -13448,6 +13712,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
+	task_tick_cache(rq, curr);
+
 	update_misfit_status(curr, rq);
 	check_update_overutilized_status(task_rq(curr));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 43bbf0693cca..0a38bfc704a4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1177,6 +1177,12 @@ struct rq {
 	struct scx_rq		scx;
 	struct sched_dl_entity	ext_server;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	raw_spinlock_t		cpu_epoch_lock ____cacheline_aligned;
+	u64			cpu_runtime;
+	unsigned long		cpu_epoch;
+	unsigned long		cpu_epoch_next;
+#endif
 
 	struct sched_dl_entity	fair_server;
 
@@ -4007,6 +4013,14 @@ static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct
 static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct *next) { }
 #endif /* !CONFIG_SCHED_MM_CID */
 
+#ifdef CONFIG_SCHED_CACHE
+static inline bool sched_cache_enabled(void)
+{
+	return false;
+}
+#endif
+extern void init_sched_mm(struct task_struct *p);
+
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 static inline
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 02/22] sched/cache: Limit the scan number of CPUs when calculating task occupancy
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
  2026-04-01 21:52 ` [Patch v4 01/22] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 03/22] sched/cache: Record per LLC utilization to guide cache aware scheduling decisions Tim Chen
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

When NUMA balancing is enabled, the kernel currently iterates over all
online CPUs to aggregate process-wide occupancy data. On large systems,
this global scan introduces significant overhead.

To reduce scan latency, limit the search to a subset of relevant CPUs:
1. The task's preferred NUMA node.
2. The node where the task is currently running.
3. The node that contains the task's current preferred LLC..

While focusing solely on the preferred NUMA node is ideal, a
process-wide scan must remain flexible because the "preferred node"
is a per-task attribute. Different threads within the same process may
have different preferred nodes, causing the process-wide preference to
migrate. Maintaining a mask that covers both the preferred and active
running nodes ensures accuracy while significantly reducing the number of
CPUs inspected.

Future work may integrate numa_group to further refine task aggregation.

Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
       New patch.

 kernel/sched/fair.c | 46 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb3cfb852a93..20a33900f4ea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1431,6 +1431,50 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
 	}
 }
 
+static void get_scan_cpumasks(cpumask_var_t cpus, struct task_struct *p)
+{
+#ifdef CONFIG_NUMA_BALANCING
+	int cpu, curr_cpu, pref_nid;
+
+	if (!static_branch_likely(&sched_numa_balancing))
+		goto out;
+
+	cpu = p->mm->sc_stat.cpu;
+	curr_cpu = task_cpu(p);
+
+	/*
+	 * Scanning in the preferred NUMA node is ideal. However, the NUMA
+	 * preferred node is per-task rather than per-process. It is possible
+	 * for different threads of the process to have distinct preferred
+	 * nodes; consequently, the process-wide preferred LLC may bounce
+	 * between different nodes. As a workaround, maintain the scan
+	 * CPU mask to also cover the process's current preferred LLC and the
+	 * current running node to mitigate the bouncing risk.
+	 * TBD: numa_group should be considered during task aggregation.
+	 */
+	pref_nid = p->numa_preferred_nid;
+	/* honor the task's preferred node */
+	if (pref_nid == NUMA_NO_NODE)
+		goto out;
+
+	cpumask_or(cpus, cpus, cpumask_of_node(pref_nid));
+
+	/* honor the task's preferred LLC CPU */
+	if (cpu != -1 && !cpumask_test_cpu(cpu, cpus))
+		cpumask_or(cpus, cpus,
+			   cpumask_of_node(cpu_to_node(cpu)));
+
+	/* make sure the task's current running node is included */
+	if (!cpumask_test_cpu(curr_cpu, cpus))
+		cpumask_or(cpus, cpus, cpumask_of_node(cpu_to_node(curr_cpu)));
+
+	return;
+
+out:
+#endif
+	cpumask_copy(cpus, cpu_online_mask);
+}
+
 static void task_cache_work(struct callback_head *work)
 {
 	struct task_struct *p = current;
@@ -1451,7 +1495,7 @@ static void task_cache_work(struct callback_head *work)
 		return;
 
 	scoped_guard (cpus_read_lock) {
-		cpumask_copy(cpus, cpu_online_mask);
+		get_scan_cpumasks(cpus, p);
 
 		for_each_cpu(cpu, cpus) {
 			/* XXX sched_cluster_active */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 03/22] sched/cache: Record per LLC utilization to guide cache aware scheduling decisions
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
  2026-04-01 21:52 ` [Patch v4 01/22] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
  2026-04-01 21:52 ` [Patch v4 02/22] sched/cache: Limit the scan number of CPUs when calculating task occupancy Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 04/22] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

When a system becomes busy and a process's preferred LLC is
saturated with too many threads, tasks within that LLC migrate
frequently. These in LLC migrations introduce latency and degrade
performance. To avoid this, task aggregation should be suppressed
when the preferred LLC is overloaded, which requires a metric to
indicate LLC utilization.

Record per LLC utilization/cpu capacity during periodic load
balancing. These statistics will be used in later patches to decide
whether tasks should be aggregated into their preferred LLC.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
       No change.

 include/linux/sched/topology.h |  4 ++
 kernel/sched/fair.c            | 70 ++++++++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 45c0022b91ce..a4e2fb31f2fd 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -68,6 +68,10 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned long	util_avg;
+	unsigned long	capacity;
+#endif
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 20a33900f4ea..f6692d9de017 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9789,6 +9789,28 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
 	return 0;
 }
 
+#ifdef CONFIG_SCHED_CACHE
+static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
+					 unsigned long *cap)
+{
+	struct sched_domain_shared *sd_share;
+
+	sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	if (!sd_share)
+		return false;
+
+	*util = READ_ONCE(sd_share->util_avg);
+	*cap = READ_ONCE(sd_share->capacity);
+
+	return true;
+}
+#else
+static inline bool get_llc_stats(int cpu, unsigned long *util,
+				 unsigned long *cap)
+{
+	return false;
+}
+#endif
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -10763,6 +10785,53 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
 
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Record the statistics for this scheduler group for later
+ * use. These values guide load balancing on aggregating tasks
+ * to a LLC.
+ */
+static void record_sg_llc_stats(struct lb_env *env,
+				struct sg_lb_stats *sgs,
+				struct sched_group *group)
+{
+	struct sched_domain_shared *sd_share;
+	int cpu;
+
+	if (!sched_cache_enabled() || env->idle == CPU_NEWLY_IDLE)
+		return;
+
+	/* Only care about sched domain spanning multiple LLCs */
+	if (env->sd->child != rcu_dereference_all(per_cpu(sd_llc, env->dst_cpu)))
+		return;
+
+	/*
+	 * At this point we know this group spans a LLC domain.
+	 * Record the statistic of this group in its corresponding
+	 * shared LLC domain.
+	 * Note: sd_share cannot be obtained via sd->child->shared,
+	 * because the latter refers to the domain that covers the
+	 * local group. Instead, sd_share should be located using
+	 * the first CPU of the LLC group.
+	 */
+	cpu = cpumask_first(sched_group_span(group));
+	sd_share = rcu_dereference_all(per_cpu(sd_llc_shared, cpu));
+	if (!sd_share)
+		return;
+
+	if (READ_ONCE(sd_share->util_avg) != sgs->group_util)
+		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
+
+	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
+		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
+}
+#else
+static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
+				       struct sched_group *group)
+{
+}
+#endif
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10852,6 +10921,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
+	record_sg_llc_stats(env, sgs, group);
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type == group_overloaded)
 		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 04/22] sched/cache: Introduce helper functions to enforce LLC migration policy
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (2 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 03/22] sched/cache: Record per LLC utilization to guide cache aware scheduling decisions Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 05/22] sched/cache: Make LLC id continuous Tim Chen
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Cache-aware scheduling aggregates threads onto their preferred LLC,
mainly through load balancing. When the preferred LLC becomes
saturated, more threads are still placed there, increasing latency.
A mechanism is needed to limit aggregation so that the preferred LLC
does not become overloaded.

Introduce helper functions can_migrate_llc() and
can_migrate_llc_task() to enforce the LLC migration policy:

  1. Aggregate a task to its preferred LLC if both source and
     destination LLCs are not too busy, or if doing so will not
     leave the preferred LLC much more imbalanced than the
     non-preferred one (>20% utilization difference, a little
     higher than the default imbalance_pct(17%) of the LLC domain
     as hysteresis). Later this threshold will be turned into tunable
     debugfs.
  2. Allow moving a task from overloaded preferred LLC to a non
     preferred LLC if this will not cause the non preferred LLC
     to become too imbalanced to cause a later migration back.
  3. If both LLCs are too busy, let the generic load balance to
     spread the tasks.

Further (hysteresis)action could be taken in the future to prevent tasks
from being migrated into and out of the preferred LLC frequently (back and
forth): the threshold for migrating a task out of its preferred LLC should
be higher than that for migrating it into the LLC.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
         Moving the capacity check after the utilization adjustment to make
         it consistent with the decision matrix(Madadi Vineeth Reddy).
    
         Raise the default threshold in fits_llc_capacity() from 50%
         to 75% for single-core system. Because single-core system
         does not need to consider the capacity of SMT. This helps
         further improve the performance base on the test conducted
         on single-core system.

 kernel/sched/fair.c | 167 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 167 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f6692d9de017..6244443ecdc0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9790,6 +9790,38 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
 }
 
 #ifdef CONFIG_SCHED_CACHE
+/*
+ * The margin used when comparing LLC utilization with CPU capacity.
+ * It determines the LLC load level where active LLC aggregation is
+ * done.
+ * Derived from fits_capacity().
+ *
+ * (default: ~50%, tunable via debugfs)
+ */
+static bool fits_llc_capacity(unsigned long util, unsigned long max)
+{
+	u32 aggr_pct = 50;
+
+	/*
+	 * For single core systems, raise the aggregation
+	 * threshold to accommodate more tasks.
+	 */
+	if (cpu_smt_num_threads == 1)
+		aggr_pct = (aggr_pct * 3 / 2);
+
+	return util * 100 < max * aggr_pct;
+}
+
+/*
+ * The margin used when comparing utilization.
+ * is 'util1' noticeably greater than 'util2'
+ * Derived from capacity_greater().
+ * Bias is in perentage.
+ */
+/* Allows dst util to be bigger than src util by up to bias percent */
+#define util_greater(util1, util2) \
+	((util1) * 100 > (util2) * 120)
+
 static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
 					 unsigned long *cap)
 {
@@ -9804,6 +9836,141 @@ static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
 
 	return true;
 }
+
+/*
+ * Decision matrix according to the LLC utilization. To
+ * decide whether we can do task aggregation across LLC.
+ *
+ * By default, 50% is the threshold for treating the LLC
+ * as busy. The reason for choosing 50% is to avoid saturation
+ * of SMT-2, and it is also a safe cutoff for other SMT-n
+ * platforms. SMT-1 has higher threshold because it is
+ * supposed to accommodate more tasks, see fits_llc_capacity().
+ *
+ * 20% is the utilization imbalance percentage to decide
+ * if the preferred LLC is busier than the non-preferred LLC.
+ * 20 is a little higher than the LLC domain's imbalance_pct
+ * 17. The hysteresis is used to avoid task bouncing between the
+ * preferred LLC and the non-preferred LLC, and it will
+ * be turned into tunable debugfs.
+ *
+ * 1. moving towards the preferred LLC, dst is the preferred
+ *    LLC, src is not.
+ *
+ * src \ dst      30%  40%  50%  60%
+ * 30%            Y    Y    Y    N
+ * 40%            Y    Y    Y    Y
+ * 50%            Y    Y    G    G
+ * 60%            Y    Y    G    G
+ *
+ * 2. moving out of the preferred LLC, src is the preferred
+ *    LLC, dst is not:
+ *
+ * src \ dst      30%  40%  50%  60%
+ * 30%            N    N    N    N
+ * 40%            N    N    N    N
+ * 50%            N    N    G    G
+ * 60%            Y    N    G    G
+ *
+ * src :      src_util
+ * dst :      dst_util
+ * Y :        Yes, migrate
+ * N :        No, do not migrate
+ * G :        let the Generic load balance to even the load.
+ *
+ * The intention is that if both LLCs are quite busy, cache aware
+ * load balance should not be performed, and generic load balance
+ * should take effect. However, if one is busy and the other is not,
+ * the preferred LLC capacity(50%) and imbalance criteria(20%) should
+ * be considered to determine whether LLC aggregation should be
+ * performed to bias the load towards the preferred LLC.
+ */
+
+/* migration decision, 3 states are orthogonal. */
+enum llc_mig {
+	mig_forbid = 0,		/* N: Don't migrate task, respect LLC preference */
+	mig_llc,		/* Y: Do LLC preference based migration */
+	mig_unrestricted	/* G: Don't restrict generic load balance migration */
+};
+
+/*
+ * Check if task can be moved from the source LLC to the
+ * destination LLC without breaking cache aware preferrence.
+ * src_cpu and dst_cpu are arbitrary CPUs within the source
+ * and destination LLCs, respectively.
+ */
+static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
+				    unsigned long tsk_util,
+				    bool to_pref)
+{
+	unsigned long src_util, dst_util, src_cap, dst_cap;
+
+	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
+	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
+		return mig_unrestricted;
+
+	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
+	dst_util = dst_util + tsk_util;
+
+	if (!fits_llc_capacity(dst_util, dst_cap) &&
+	    !fits_llc_capacity(src_util, src_cap))
+		return mig_unrestricted;
+
+	if (to_pref) {
+		/*
+		 * Don't migrate if we will get preferred LLC too
+		 * heavily loaded and if the dest is much busier
+		 * than the src, in which case migration will
+		 * increase the imbalance too much.
+		 */
+		if (!fits_llc_capacity(dst_util, dst_cap) &&
+		    util_greater(dst_util, src_util))
+			return mig_forbid;
+	} else {
+		/*
+		 * Don't migrate if we will leave preferred LLC
+		 * too idle, or if this migration leads to the
+		 * non-preferred LLC falls within sysctl_aggr_imb percent
+		 * of preferred LLC, leading to migration again
+		 * back to preferred LLC.
+		 */
+		if (fits_llc_capacity(src_util, src_cap) ||
+		    !util_greater(src_util, dst_util))
+			return mig_forbid;
+	}
+	return mig_llc;
+}
+
+/*
+ * Check if task p can migrate from source LLC to
+ * destination LLC in terms of cache aware load balance.
+ */
+static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
+							struct task_struct *p)
+{
+	struct mm_struct *mm;
+	bool to_pref;
+	int cpu;
+
+	mm = p->mm;
+	if (!mm)
+		return mig_unrestricted;
+
+	cpu = mm->sc_stat.cpu;
+	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
+		return mig_unrestricted;
+
+	if (cpus_share_cache(dst_cpu, cpu))
+		to_pref = true;
+	else if (cpus_share_cache(src_cpu, cpu))
+		to_pref = false;
+	else
+		return mig_unrestricted;
+
+	return can_migrate_llc(src_cpu, dst_cpu,
+			       task_util(p), to_pref);
+}
+
 #else
 static inline bool get_llc_stats(int cpu, unsigned long *util,
 				 unsigned long *cap)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 05/22] sched/cache: Make LLC id continuous
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (3 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 04/22] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 06/22] sched/cache: Assign preferred LLC ID to processes Tim Chen
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

Introduce an index mapping between CPUs and their LLCs. This provides
a roughly continuous per LLC index needed for cache-aware load balancing in
later patches.

The existing per_cpu llc_id usually points to the first CPU of the
LLC domain, which is sparse and unsuitable as an array index. Using
llc_id directly would waste memory.

With the new mapping, CPUs in the same LLC share an approximate
continuous id:

  per_cpu(llc_id, CPU=0...15)  = 0
  per_cpu(llc_id, CPU=16...31) = 1
  per_cpu(llc_id, CPU=32...47) = 2
  ...

Note that the LLC IDs are allocated via bitmask, so the IDs may be
reused during CPU offline->online transitions.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Originally-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
         Leverage dynamic cpumask management infrastructure
         for LLC id allocation.
         (K Prateek Nayak, Peter Zijlstra)

 kernel/sched/core.c     |  2 +
 kernel/sched/sched.h    |  3 ++
 kernel/sched/topology.c | 90 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 93 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eff8695000e7..1188b5d24933 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8417,6 +8417,8 @@ int sched_cpu_deactivate(unsigned int cpu)
 	 */
 	synchronize_rcu();
 
+	sched_domains_free_llc_id(cpu);
+
 	sched_set_rq_offline(rq, cpu);
 
 	scx_rq_deactivate(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0a38bfc704a4..9defeeeb3e8e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4019,6 +4019,9 @@ static inline bool sched_cache_enabled(void)
 	return false;
 }
 #endif
+
+void sched_domains_free_llc_id(int cpu);
+
 extern void init_sched_mm(struct task_struct *p);
 
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 32dcddaead82..edf6d7ec73ca 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -18,8 +18,10 @@ void sched_domains_mutex_unlock(void)
 }
 
 /* Protected by sched_domains_mutex: */
+static cpumask_var_t sched_domains_llc_id_allocmask;
 static cpumask_var_t sched_domains_tmpmask;
 static cpumask_var_t sched_domains_tmpmask2;
+int max_lid;
 
 static int __init sched_debug_setup(char *str)
 {
@@ -663,7 +665,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
  */
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
-DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_llc_id) = -1;
 DEFINE_PER_CPU(int, sd_share_id);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
@@ -689,7 +691,6 @@ static void update_top_cache_domain(int cpu)
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
-	per_cpu(sd_llc_id, cpu) = id;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_CLUSTER);
@@ -1776,6 +1777,11 @@ const struct cpumask *tl_mc_mask(struct sched_domain_topology_level *tl, int cpu
 {
 	return cpu_coregroup_mask(cpu);
 }
+
+#define llc_mask(cpu) cpu_coregroup_mask(cpu)
+
+#else
+#define llc_mask(cpu) cpumask_of(cpu)
 #endif
 
 const struct cpumask *tl_pkg_mask(struct sched_domain_topology_level *tl, int cpu)
@@ -2548,6 +2554,61 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
 	return true;
 }
 
+static int __sched_domains_alloc_llc_id(void)
+{
+	int lid, max;
+
+	lockdep_assert_held(&sched_domains_mutex);
+
+	lid = cpumask_first_zero(sched_domains_llc_id_allocmask);
+	/*
+	 * llc_id space should never grow larger than the
+	 * possible number of CPUs in the system.
+	 */
+	if (lid >= nr_cpu_ids)
+		return -1;
+
+	__cpumask_set_cpu(lid, sched_domains_llc_id_allocmask);
+	max = cpumask_last(sched_domains_llc_id_allocmask);
+	if (max > max_lid)
+		max_lid = max;
+
+	return lid;
+}
+
+static void __sched_domains_free_llc_id(int cpu)
+{
+	int i, lid, max;
+
+	lockdep_assert_held(&sched_domains_mutex);
+
+	lid = per_cpu(sd_llc_id, cpu);
+	if (lid == -1 || lid >= nr_cpu_ids)
+		return;
+
+	per_cpu(sd_llc_id, cpu) = -1;
+
+	for_each_cpu(i, llc_mask(cpu)) {
+		/* An online CPU owns the llc_id. */
+		if (per_cpu(sd_llc_id, i) == lid)
+			return;
+	}
+
+	__cpumask_clear_cpu(lid, sched_domains_llc_id_allocmask);
+
+	max = cpumask_last(sched_domains_llc_id_allocmask);
+	/* shrink max lid to save memory */
+	if (max < max_lid)
+		max_lid = max;
+}
+
+void sched_domains_free_llc_id(int cpu)
+{
+	sched_domains_mutex_lock();
+	__sched_domains_free_llc_id(cpu);
+	sched_domains_mutex_unlock();
+}
+
 /*
  * Build sched domains for a given set of CPUs and attach the sched domains
  * to the individual CPUs
@@ -2573,6 +2634,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	/* Set up domains for CPUs specified by the cpu_map: */
 	for_each_cpu(i, cpu_map) {
 		struct sched_domain_topology_level *tl;
+		int lid;
 
 		sd = NULL;
 		for_each_sd_topology(tl) {
@@ -2586,6 +2648,29 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 			if (cpumask_equal(cpu_map, sched_domain_span(sd)))
 				break;
 		}
+
+		lid = per_cpu(sd_llc_id, i);
+		if (lid == -1) {
+			/* try to reuse the llc_id of its siblings */
+			for (int j = cpumask_first(llc_mask(i));
+			     j < nr_cpu_ids;
+			     j = cpumask_next(j, llc_mask(i))) {
+				if (i == j)
+					continue;
+
+				lid = per_cpu(sd_llc_id, j);
+
+				if (lid != -1) {
+					per_cpu(sd_llc_id, i) = lid;
+
+					break;
+				}
+			}
+
+			/* a new LLC is detected */
+			if (lid == -1)
+				per_cpu(sd_llc_id, i) = __sched_domains_alloc_llc_id();
+		}
 	}
 
 	if (WARN_ON(!topology_span_sane(cpu_map)))
@@ -2762,6 +2847,7 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
 {
 	int err;
 
+	zalloc_cpumask_var(&sched_domains_llc_id_allocmask, GFP_KERNEL);
 	zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
 	zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
 	zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 06/22] sched/cache: Assign preferred LLC ID to processes
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (4 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 05/22] sched/cache: Make LLC id continuous Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 07/22] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

With cache-aware scheduling enabled, each task is assigned a
preferred LLC ID. This allows quick identification of the LLC domain
where the task prefers to run, similar to numa_preferred_nid in
NUMA balancing.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
        Use WRITE_ONCE()/READ_ONCE() on p->preferred_llc
        (Madadi Vineeth Reddy)

 include/linux/sched.h |  1 +
 init/init_task.c      |  3 +++
 kernel/sched/fair.c   | 43 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bd33f5b9096b..526108acc483 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1408,6 +1408,7 @@ struct task_struct {
 
 #ifdef CONFIG_SCHED_CACHE
 	struct callback_head		cache_work;
+	int				preferred_llc;
 #endif
 
 	struct rseq_data		rseq;
diff --git a/init/init_task.c b/init/init_task.c
index 5c838757fc10..9f964898d40e 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -214,6 +214,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.numa_group	= NULL,
 	.numa_faults	= NULL,
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	.preferred_llc  = -1,
+#endif
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 	.kasan_depth	= 1,
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6244443ecdc0..1eda689e0136 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1366,11 +1366,43 @@ static unsigned long fraction_mm_sched(struct rq *rq,
 	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
 }
 
+static int get_pref_llc(struct task_struct *p, struct mm_struct *mm)
+{
+	int mm_sched_llc = -1;
+
+	if (!mm)
+		return -1;
+
+	if (mm->sc_stat.cpu != -1) {
+		mm_sched_llc = llc_id(mm->sc_stat.cpu);
+
+#ifdef CONFIG_NUMA_BALANCING
+		/*
+		 * Don't assign preferred LLC if it
+		 * conflicts with NUMA balancing.
+		 * This can happen when sched_setnuma() gets
+		 * called, however it is not much of an issue
+		 * because we expect account_mm_sched() to get
+		 * called fairly regularly -- at a higher rate
+		 * than sched_setnuma() at least -- and thus the
+		 * conflict only exists for a short period of time.
+		 */
+		if (static_branch_likely(&sched_numa_balancing) &&
+		    p->numa_preferred_nid >= 0 &&
+		    cpu_to_node(mm->sc_stat.cpu) != p->numa_preferred_nid)
+			mm_sched_llc = -1;
+#endif
+	}
+
+	return mm_sched_llc;
+}
+
 static inline
 void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
 	struct sched_cache_time *pcpu_sched;
 	struct mm_struct *mm = p->mm;
+	int mm_sched_llc = -1;
 	unsigned long epoch;
 
 	if (!sched_cache_enabled())
@@ -1404,6 +1436,11 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 		if (mm->sc_stat.cpu != -1)
 			mm->sc_stat.cpu = -1;
 	}
+
+	mm_sched_llc = get_pref_llc(p, mm);
+
+	if (READ_ONCE(p->preferred_llc) != mm_sched_llc)
+		WRITE_ONCE(p->preferred_llc, mm_sched_llc);
 }
 
 static void task_tick_cache(struct rq *rq, struct task_struct *p)
@@ -1577,6 +1614,12 @@ void init_sched_mm(struct task_struct *p) { }
 
 static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
 
+static inline int get_pref_llc(struct task_struct *p,
+			       struct mm_struct *mm)
+{
+	return -1;
+}
+
 #endif /* CONFIG_SCHED_CACHE */
 
 /*
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 07/22] sched/cache: Track LLC-preferred tasks per runqueue
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (5 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 06/22] sched/cache: Assign preferred LLC ID to processes Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 08/22] sched/cache: Introduce per CPU's tasks LLC preference counter Tim Chen
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

For each runqueue, track the number of tasks with an LLC preference
and how many of them are running on their preferred LLC. This mirrors
nr_numa_running and nr_preferred_running for NUMA balancing, and will
be used by cache-aware load balancing in later patches.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
        Use READ_ONCE() to access p->preferred_llc
        (Madadi Vineeth Reddy).

 kernel/sched/core.c  |  5 +++++
 kernel/sched/fair.c  | 47 +++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h |  8 ++++++++
 3 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1188b5d24933..93a1dbc02667 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -539,6 +539,11 @@ void __trace_set_current_state(int state_value)
 }
 EXPORT_SYMBOL(__trace_set_current_state);
 
+int task_llc(const struct task_struct *p)
+{
+	return per_cpu(sd_llc_id, task_cpu(p));
+}
+
 /*
  * Serialization rules:
  *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1eda689e0136..4b760bd604e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1291,6 +1291,30 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
 
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
+{
+	int pref_llc;
+
+	pref_llc = p->preferred_llc;
+	if (pref_llc < 0)
+		return;
+
+	rq->nr_llc_running++;
+	rq->nr_pref_llc_running += (pref_llc == task_llc(p));
+}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
+{
+	int pref_llc;
+
+	pref_llc = p->preferred_llc;
+	if (pref_llc < 0)
+		return;
+
+	rq->nr_llc_running--;
+	rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
+}
+
 void mm_init_sched(struct mm_struct *mm,
 		   struct sched_cache_time __percpu *_pcpu_sched)
 {
@@ -1397,6 +1421,8 @@ static int get_pref_llc(struct task_struct *p, struct mm_struct *mm)
 	return mm_sched_llc;
 }
 
+static unsigned int task_running_on_cpu(int cpu, struct task_struct *p);
+
 static inline
 void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
@@ -1439,8 +1465,13 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 
 	mm_sched_llc = get_pref_llc(p, mm);
 
-	if (READ_ONCE(p->preferred_llc) != mm_sched_llc)
+	/* task not on rq accounted later in account_entity_enqueue() */
+	if (task_running_on_cpu(rq->cpu, p) &&
+	    READ_ONCE(p->preferred_llc) != mm_sched_llc) {
+		account_llc_dequeue(rq, p);
 		WRITE_ONCE(p->preferred_llc, mm_sched_llc);
+		account_llc_enqueue(rq, p);
+	}
 }
 
 static void task_tick_cache(struct rq *rq, struct task_struct *p)
@@ -1620,6 +1651,10 @@ static inline int get_pref_llc(struct task_struct *p,
 	return -1;
 }
 
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p) {}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p) {}
+
 #endif /* CONFIG_SCHED_CACHE */
 
 /*
@@ -4106,9 +4141,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	update_load_add(&cfs_rq->load, se->load.weight);
 	if (entity_is_task(se)) {
+		struct task_struct *p = task_of(se);
 		struct rq *rq = rq_of(cfs_rq);
 
-		account_numa_enqueue(rq, task_of(se));
+		account_numa_enqueue(rq, p);
+		account_llc_enqueue(rq, p);
 		list_add(&se->group_node, &rq->cfs_tasks);
 	}
 	cfs_rq->nr_queued++;
@@ -4119,7 +4156,11 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (entity_is_task(se)) {
-		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+		struct task_struct *p = task_of(se);
+		struct rq *rq = rq_of(cfs_rq);
+
+		account_numa_dequeue(rq, p);
+		account_llc_dequeue(rq, p);
 		list_del_init(&se->group_node);
 	}
 	cfs_rq->nr_queued--;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9defeeeb3e8e..081f23a48414 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1195,6 +1195,12 @@ struct rq {
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int		numa_migrate_on;
 #endif
+
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int		nr_pref_llc_running;
+	unsigned int		nr_llc_running;
+#endif
+
 	/*
 	 * This is part of a global counter where only the total sum
 	 * over all CPUs matters. A task can increase this counter on
@@ -2066,6 +2072,8 @@ init_numa_balancing(u64 clone_flags, struct task_struct *p)
 
 #endif /* !CONFIG_NUMA_BALANCING */
 
+int task_llc(const struct task_struct *p);
+
 static inline void
 queue_balance_callback(struct rq *rq,
 		       struct balance_callback *head,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 08/22] sched/cache: Introduce per CPU's tasks LLC preference counter
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (6 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 07/22] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 09/22] sched/cache: Calculate the percpu sd task LLC preference Tim Chen
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

The lowest level of sched domain for each CPU is assigned an
array where each element tracks the number of tasks preferring
a given LLC, indexed from 0 to max_lid. Since each CPU
has its dedicated sd, this implies that each CPU will have
a dedicated task LLC preference counter.

For example, sd->llc_counts[3] = 2 signifies that there
are 2 tasks on this runqueue which prefer to run within LLC3.

The load balancer can use this information to identify busy
runqueues and migrate tasks to their preferred LLC domains.
This array will be reallocated at runtime during sched domain
rebuild.

Introduce the buffer allocation mechanism, and the statistics
will be calculated in the subsequent patch.

Note: the LLC preference statistics of each CPU are reset on
sched domain rebuild and may under count temporarily, until the
CPU becomes idle and the count is cleared. This is a trade off
to avoid complex data synchronization across sched domain builds.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
       Rename pf to llc_counter to better reflect its usage;
       Record its size (max_llcs) per sched domain;
       Publish the llc_counter and its size together in
       cpu_attach_domain().
       (Peter Zijlstra)

 include/linux/sched/topology.h | 13 ++++++++
 kernel/sched/topology.c        | 61 +++++++++++++++++++++++++++++++++-
 2 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index a4e2fb31f2fd..73153a3d9036 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -102,6 +102,19 @@ struct sched_domain {
 	u64 max_newidle_lb_cost;
 	unsigned long last_decay_max_lb_cost;
 
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int llc_max;
+	/*
+	 * per LLC preference counter
+	 * __counted_by cannot be used here because
+	 * when the percpu sched_domain is being allocated,
+	 * llc_max is unknown, and thus the actual size
+	 * of the sched_domain(including the llc_counts elements)
+	 * is undetermined.
+	 */
+	unsigned int *llc_counts;
+#endif
+
 #ifdef CONFIG_SCHEDSTATS
 	/* sched_balance_rq() stats */
 	unsigned int lb_count[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index edf6d7ec73ca..995a42cb4697 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -634,6 +634,11 @@ static void destroy_sched_domain(struct sched_domain *sd)
 
 	if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
 		kfree(sd->shared);
+
+#ifdef CONFIG_SCHED_CACHE
+	/* only the bottom sd has llc_counts array */
+	kfree(sd->llc_counts);
+#endif
 	kfree(sd);
 }
 
@@ -753,10 +758,18 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 	if (sd && sd_degenerate(sd)) {
 		tmp = sd;
 		sd = sd->parent;
-		destroy_sched_domain(tmp);
+
 		if (sd) {
 			struct sched_group *sg = sd->groups;
 
+#ifdef CONFIG_SCHED_CACHE
+			/* move buffer to parent as child is being destroyed */
+			sd->llc_counts = tmp->llc_counts;
+			sd->llc_max = tmp->llc_max;
+			/* make sure destroy_sched_domain() does not free it */
+			tmp->llc_counts = NULL;
+			tmp->llc_max = 0;
+#endif
 			/*
 			 * sched groups hold the flags of the child sched
 			 * domain for convenience. Clear such flags since
@@ -768,6 +781,8 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 
 			sd->child = NULL;
 		}
+
+		destroy_sched_domain(tmp);
 	}
 
 	sched_domain_debug(sd, cpu);
@@ -793,6 +808,48 @@ enum s_alloc {
 	sa_none,
 };
 
+#ifdef CONFIG_SCHED_CACHE
+static bool alloc_sd_llc(const struct cpumask *cpu_map,
+			 struct s_data *d)
+{
+	struct sched_domain *sd;
+	unsigned int *p;
+	int i;
+
+	for_each_cpu(i, cpu_map) {
+		sd = *per_cpu_ptr(d->sd, i);
+		if (!sd)
+			goto err;
+
+		p = kcalloc(max_lid + 1, sizeof(unsigned int), GFP_KERNEL);
+		if (!p)
+			goto err;
+
+		sd->llc_counts = p;
+		sd->llc_max = max_lid;
+	}
+
+	return true;
+err:
+	for_each_cpu(i, cpu_map) {
+		sd = *per_cpu_ptr(d->sd, i);
+		if (sd) {
+			sd->llc_max = 0;
+			kfree(sd->llc_counts);
+			sd->llc_counts = NULL;
+		}
+	}
+
+	return false;
+}
+#else
+static bool alloc_sd_llc(const struct cpumask *cpu_map,
+			 struct s_data *d)
+{
+	return false;
+}
+#endif
+
 /*
  * Return the canonical balance CPU for this group, this is the first CPU
  * of this group that's also in the balance mask.
@@ -2759,6 +2816,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	alloc_sd_llc(cpu_map, &d);
+
 	/* Attach the domains */
 	rcu_read_lock();
 	for_each_cpu(i, cpu_map) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 09/22] sched/cache: Calculate the percpu sd task LLC preference
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (7 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 08/22] sched/cache: Introduce per CPU's tasks LLC preference counter Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 10/22] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

Calculate the number of tasks' LLC preferences for each runqueue.
This statistic is computed during task enqueue and dequeue
operations, and is used by the cache-aware load balancing.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
        Remove unnecessary rcu_read_lock() in eq/dq as rq lock
        is held. Use rcu_dereference_all() directly.
        (Peter Zijlstra)

 kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4b760bd604e7..e6474e61f4aa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1291,8 +1291,34 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
 
+static inline bool valid_llc_buf(struct sched_domain *sd,
+				 int id)
+{
+	/*
+	 * These checks are used to avoid the following
+	 * race causing out-of-range access to llc_counts:
+	 *
+	 * CPU0                                CPU1
+	 * :
+	 * ...
+	 * build_sched_domains          update_sg_lb_stats
+	 *                                for_each_cpu_and(i, sg)
+	 *                                  sd=rq[i]->sd
+	 *   per_cpu(sd_llc_id,i)=new_llc
+	 *                                  llc=llc_id(i)
+	 *                                  !!!sd->llc_counts[llc]!!!
+	 *   sd->llc_counts=kzalloc()
+	 *   sd->llc_max=max_llc
+	 */
+	if (unlikely(id < 0 || !sd || !sd->llc_counts || id > sd->llc_max))
+		return false;
+
+	return true;
+}
+
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
+	struct sched_domain *sd;
 	int pref_llc;
 
 	pref_llc = p->preferred_llc;
@@ -1301,10 +1327,15 @@ static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 
 	rq->nr_llc_running++;
 	rq->nr_pref_llc_running += (pref_llc == task_llc(p));
+
+	sd = rcu_dereference_all(rq->sd);
+	if (valid_llc_buf(sd, pref_llc))
+		sd->llc_counts[pref_llc]++;
 }
 
 static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
 {
+	struct sched_domain *sd;
 	int pref_llc;
 
 	pref_llc = p->preferred_llc;
@@ -1313,6 +1344,24 @@ static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
 
 	rq->nr_llc_running--;
 	rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
+
+	sd = rcu_dereference_all(rq->sd);
+	if (valid_llc_buf(sd, pref_llc)) {
+		/*
+		 * There is a race condition between dequeue
+		 * and CPU hotplug. After a task has been enqueued
+		 * on CPUx, a CPU hotplug event occurs, and all online
+		 * CPUs (including CPUx) rebuild their sched_domains
+		 * and reset statistics to zero(including sd->llc_counts).
+		 * This can cause temporary undercount and we have to
+		 * check for such underflow in sd->llc_counts.
+		 *
+		 * This undercount is temporary and accurate accounting
+		 * will resume once the rq has a chance to be idle.
+		 */
+		if (sd->llc_counts[pref_llc])
+			sd->llc_counts[pref_llc]--;
+	}
 }
 
 void mm_init_sched(struct mm_struct *mm,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 10/22] sched/cache: Count tasks prefering destination LLC in a sched group
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (8 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 09/22] sched/cache: Calculate the percpu sd task LLC preference Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 11/22] sched/cache: Check local_group only once in update_sg_lb_stats() Tim Chen
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

During LLC load balancing, tabulate the number of tasks on each runqueue
that prefer the LLC contains the env->dst_cpu in a sched group.

For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
selected as the busiest source to pick tasks from.

Within a source LLC, the total number of tasks preferring a destination
LLC is computed by summing counts across all CPUs in that LLC. For
instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
LLC3, the total for LLC0 is 3.

These statistics allow the load balancer to choose tasks from source
sched groups that best match their preferred LLCs.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
        Use rcu_dereference_all() for protection(Peter Zijlstra)
        Move the declaration of dst_llc variable under sched_cache_enabled(),
        and remove the corresponding ifdeffery(Peter Zijlstra)

 kernel/sched/fair.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e6474e61f4aa..62a628a1594c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10681,6 +10681,9 @@ struct sg_lb_stats {
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int nr_pref_dst_llc;
+#endif
 };
 
 /*
@@ -11170,6 +11173,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if (cpu_overutilized(i))
 			*sg_overutilized = 1;
 
+#ifdef CONFIG_SCHED_CACHE
+		if (sched_cache_enabled()) {
+			struct sched_domain *sd_tmp;
+			int dst_llc;
+
+			dst_llc = llc_id(env->dst_cpu);
+			if (llc_id(i) != dst_llc) {
+				sd_tmp = rcu_dereference_all(rq->sd);
+				if (valid_llc_buf(sd_tmp, dst_llc))
+					sgs->nr_pref_dst_llc +=
+						sd_tmp->llc_counts[dst_llc];
+			}
+		}
+#endif
+
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 11/22] sched/cache: Check local_group only once in update_sg_lb_stats()
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (9 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 10/22] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 12/22] sched/cache: Prioritize tasks preferring destination LLC during balancing Tim Chen
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

There is no need to check the local group twice for both
group_asym_packing and group_smt_balance. Adjust the code
to facilitate future checks for group types (cache-aware
load balancing) as well.

No functional changes are expected.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
        No change.

 kernel/sched/fair.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 62a628a1594c..4ed84086244c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11228,14 +11228,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_weight = group->group_weight;
 
-	/* Check if dst CPU is idle and preferred to this group */
-	if (!local_group && env->idle && sgs->sum_h_nr_running &&
-	    sched_group_asym(env, sgs, group))
-		sgs->group_asym_packing = 1;
-
-	/* Check for loaded SMT group to be balanced to dst CPU */
-	if (!local_group && smt_balance(env, sgs, group))
-		sgs->group_smt_balance = 1;
+	if (!local_group) {
+		/* Check if dst CPU is idle and preferred to this group */
+		if (env->idle && sgs->sum_h_nr_running &&
+		    sched_group_asym(env, sgs, group))
+			sgs->group_asym_packing = 1;
+
+		/* Check for loaded SMT group to be balanced to dst CPU */
+		if (smt_balance(env, sgs, group))
+			sgs->group_smt_balance = 1;
+	}
 
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 12/22] sched/cache: Prioritize tasks preferring destination LLC during balancing
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (10 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 11/22] sched/cache: Check local_group only once in update_sg_lb_stats() Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 13/22] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

During LLC load balancing, first check for tasks that prefer the
destination LLC and balance them to it before others.

Mark source sched groups containing tasks preferring non local LLCs
with the group_llc_balance flag. This ensures the load balancer later
pulls or pushes these tasks toward their preferred LLCs.
The priority of group_llc_balance is lower than that of group_overloaded
and higher than that of all other group types. This is because
group_llc_balance may exacerbate load imbalance, and if the LLC balancing
attempt fails, the nr_balance_failed mechanism will trigger other group
types to rebalance the load.

The load balancer selects the busiest sched_group and migrates tasks
to less busy groups to distribute load across CPUs.

With cache-aware scheduling enabled, the busiest sched_group is
the one with most tasks preferring the destination LLC. If
the group has the llc_balance flag set, cache aware load balancing is
triggered.

Introduce the helper function update_llc_busiest() to identify the
sched_group with the most tasks preferring the destination LLC.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
        Add comments to explain LLC load balance priority.
        (Madadi Vineeth Reddy)

 kernel/sched/fair.c | 79 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 78 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4ed84086244c..c032eeebe191 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9747,6 +9747,16 @@ enum group_type {
 	 * from balancing the load across the system.
 	 */
 	group_imbalanced,
+	/*
+	 * There are tasks running on non-preferred LLC, possible to move
+	 * them to their preferred LLC without creating too much imbalance.
+	 * The priority of group_llc_balance is lower than that of
+	 * group_overloaded and higher than that of all other group types.
+	 * This is because group_llc_balance may exacerbate load imbalance.
+	 * If the LLC balancing attempt fails, the nr_balance_failed
+	 * mechanism will trigger other group types to rebalance the load.
+	 */
+	group_llc_balance,
 	/*
 	 * The CPU is overloaded and can't provide expected CPU cycles to all
 	 * tasks.
@@ -10676,6 +10686,7 @@ struct sg_lb_stats {
 	enum group_type group_type;
 	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
 	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
+	unsigned int group_llc_balance;		/* Tasks should be moved to preferred LLC */
 	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -10934,6 +10945,9 @@ group_type group_classify(unsigned int imbalance_pct,
 	if (group_is_overloaded(imbalance_pct, sgs))
 		return group_overloaded;
 
+	if (sgs->group_llc_balance)
+		return group_llc_balance;
+
 	if (sg_imbalanced(group))
 		return group_imbalanced;
 
@@ -11128,11 +11142,63 @@ static void record_sg_llc_stats(struct lb_env *env,
 	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
 		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
 }
+
+/*
+ * Do LLC balance on sched group that contains LLC, and have tasks preferring
+ * to run on LLC in idle dst_cpu.
+ */
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+			       struct sched_group *group)
+{
+	if (!sched_cache_enabled())
+		return false;
+
+	if (env->sd->flags & SD_SHARE_LLC)
+		return false;
+
+	/*
+	 * Skip cache aware tagging if nr_balanced_failed is sufficiently high.
+	 * Threshold of cache_nice_tries is set to 1 higher than nr_balance_failed
+	 * to avoid excessive task migration at the same time.
+	 */
+	if (env->sd->nr_balance_failed >= env->sd->cache_nice_tries + 1)
+		return false;
+
+	if (sgs->nr_pref_dst_llc &&
+	    can_migrate_llc(cpumask_first(sched_group_span(group)),
+			    env->dst_cpu, 0, true) == mig_llc)
+		return true;
+
+	return false;
+}
+
+static bool update_llc_busiest(struct lb_env *env,
+			       struct sg_lb_stats *busiest,
+			       struct sg_lb_stats *sgs)
+{
+	/*
+	 * There are more tasks that want to run on dst_cpu's LLC.
+	 */
+	return sgs->nr_pref_dst_llc > busiest->nr_pref_dst_llc;
+}
 #else
 static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
 				       struct sched_group *group)
 {
 }
+
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+			       struct sched_group *group)
+{
+	return false;
+}
+
+static bool update_llc_busiest(struct lb_env *env,
+			       struct sg_lb_stats *busiest,
+			       struct sg_lb_stats *sgs)
+{
+	return false;
+}
 #endif
 
 /**
@@ -11237,6 +11303,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		/* Check for loaded SMT group to be balanced to dst CPU */
 		if (smt_balance(env, sgs, group))
 			sgs->group_smt_balance = 1;
+
+		/* Check for tasks in this group can be moved to their preferred LLC */
+		if (llc_balance(env, sgs, group))
+			sgs->group_llc_balance = 1;
 	}
 
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
@@ -11300,6 +11370,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		/* Select the overloaded group with highest avg_load. */
 		return sgs->avg_load > busiest->avg_load;
 
+	case group_llc_balance:
+		/* Select the group with most tasks preferring dst LLC */
+		return update_llc_busiest(env, busiest, sgs);
+
 	case group_imbalanced:
 		/*
 		 * Select the 1st imbalanced group as we don't have any way to
@@ -11562,6 +11636,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
 			return false;
 		break;
 
+	case group_llc_balance:
 	case group_imbalanced:
 	case group_asym_packing:
 	case group_smt_balance:
@@ -11694,6 +11769,7 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
 			return NULL;
 		break;
 
+	case group_llc_balance:
 	case group_imbalanced:
 	case group_asym_packing:
 	case group_smt_balance:
@@ -12192,7 +12268,8 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
 	 * group's child domain.
 	 */
 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
-	    sibling_imbalance(env, &sds, busiest, local) > 1)
+	    (busiest->group_type == group_llc_balance ||
+	    sibling_imbalance(env, &sds, busiest, local) > 1))
 		goto force_balance;
 
 	if (busiest->group_type != group_overloaded) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 13/22] sched/cache: Add migrate_llc_task migration type for cache-aware balancing
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (11 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 12/22] sched/cache: Prioritize tasks preferring destination LLC during balancing Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 14/22] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

Introduce a new migration type, migrate_llc_task, to support
cache-aware load balancing.

After identifying the busiest sched_group (having the most tasks
preferring the destination LLC), mark migrations with this type.
During load balancing, each runqueue in the busiest sched_group is
examined, and the runqueue with the highest number of tasks preferring
the destination CPU is selected as the busiest runqueue.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
       No change.

 kernel/sched/fair.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c032eeebe191..e0e618cd4e15 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9768,7 +9768,8 @@ enum migration_type {
 	migrate_load = 0,
 	migrate_util,
 	migrate_task,
-	migrate_misfit
+	migrate_misfit,
+	migrate_llc_task
 };
 
 #define LBF_ALL_PINNED	0x01
@@ -10382,6 +10383,10 @@ static int detach_tasks(struct lb_env *env)
 
 			env->imbalance = 0;
 			break;
+
+		case migrate_llc_task:
+			env->imbalance--;
+			break;
 		}
 
 		detach_task(p, env);
@@ -12022,6 +12027,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		return;
 	}
 
+#ifdef CONFIG_SCHED_CACHE
+	if (busiest->group_type == group_llc_balance) {
+		/* Move a task that prefer local LLC */
+		env->migration_type = migrate_llc_task;
+		env->imbalance = 1;
+		return;
+	}
+#endif
+
 	if (busiest->group_type == group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -12328,7 +12342,10 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 {
 	struct rq *busiest = NULL, *rq;
 	unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
+	unsigned int __maybe_unused busiest_pref_llc = 0;
+	struct sched_domain __maybe_unused *sd_tmp;
 	unsigned int busiest_nr = 0;
+	int __maybe_unused dst_llc;
 	int i;
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
@@ -12456,6 +12473,23 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 
 			break;
 
+		case migrate_llc_task:
+#ifdef CONFIG_SCHED_CACHE
+			sd_tmp = rcu_dereference_all(rq->sd);
+			dst_llc = llc_id(env->dst_cpu);
+
+			if (valid_llc_buf(sd_tmp, dst_llc)) {
+				unsigned int this_pref_llc =
+					sd_tmp->llc_counts[dst_llc];
+
+				if (busiest_pref_llc < this_pref_llc) {
+					busiest_pref_llc = this_pref_llc;
+					busiest = rq;
+				}
+			}
+#endif
+			break;
+
 		}
 	}
 
@@ -12619,6 +12653,8 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
 	case migrate_misfit:
 		__schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
 		break;
+	case migrate_llc_task:
+		break;
 	}
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 14/22] sched/cache: Handle moving single tasks to/from their preferred LLC
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (12 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 13/22] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 15/22] sched/cache: Respect LLC preference in task migration and detach Tim Chen
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

Cache aware scheduling mainly does two things:
1. Prevent task from migrating out of its preferred LLC if not
   nessasary.
2. Migrating task to their preferred LLC if nessasary.

For 1:
In the generic load balance, if the busiest runqueue has only one task,
active balancing may be invoked to move it away. However, this migration
might break LLC locality.

Prevent regular load balance from migrating a task that
prefers the current LLC. The load level and imbalance do not warrant
breaking LLC preference per the can_migrate_llc() policy. Here, the
benefit of LLC locality outweighs the power efficiency gained from
migrating the only runnable task away.

Before migration, check whether the task is running on its preferred
LLC: Do not move a lone task to another LLC if it would move the task
away from its preferred LLC or cause excessive imbalance between LLCs.

For 2:
On the other hand, if the migration type is migrate_llc_task, it means
that there are tasks on the env->src_cpu that want to be migrated to
their preferred LLC, launch the active load balance anyway.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
        Use rcu_dereference_all() in alb_break_llc().
        Add comments to explain the scenario to inhibit active balancing
        for cache aware aggregation.

 kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e0e618cd4e15..fef916afa1d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10115,12 +10115,60 @@ static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu
 			       task_util(p), to_pref);
 }
 
+/*
+ * Check if active load balance breaks LLC locality in
+ * terms of cache aware load balance. The load level and
+ * imbalance do not warrant breaking LLC preference per
+ * the can_migrate_llc() policy. Here, the benefit of
+ * LLC locality outweighs the power efficiency gained from
+ * migrating the only runnable task away.
+ */
+static inline bool
+alb_break_llc(struct lb_env *env)
+{
+	if (!sched_cache_enabled())
+		return false;
+
+	if (cpus_share_cache(env->src_cpu, env->dst_cpu))
+		return false;
+	/*
+	 * All tasks prefer to stay on their current CPU.
+	 * Do not pull a task from its preferred CPU if:
+	 * 1. It is the only task running there(not too imbalance); OR
+	 * 2. Migrating it away from its preferred LLC would violate
+	 *    the cache-aware scheduling policy.
+	 */
+	if (env->src_rq->nr_pref_llc_running &&
+	    env->src_rq->nr_pref_llc_running == env->src_rq->cfs.h_nr_runnable) {
+		unsigned long util = 0;
+		struct task_struct *cur;
+
+		if (env->src_rq->nr_running <= 1)
+			return true;
+
+		cur = rcu_dereference_all(env->src_rq->curr);
+		if (cur)
+			util = task_util(cur);
+
+		if (can_migrate_llc(env->src_cpu, env->dst_cpu,
+				    util, false) == mig_forbid)
+			return true;
+	}
+
+	return false;
+}
 #else
 static inline bool get_llc_stats(int cpu, unsigned long *util,
 				 unsigned long *cap)
 {
 	return false;
 }
+
+static inline bool
+alb_break_llc(struct lb_env *env)
+{
+	return false;
+}
 #endif
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
@@ -12541,6 +12589,9 @@ static int need_active_balance(struct lb_env *env)
 {
 	struct sched_domain *sd = env->sd;
 
+	if (alb_break_llc(env))
+		return 0;
+
 	if (asym_active_balance(env))
 		return 1;
 
@@ -12560,7 +12611,8 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
-	if (env->migration_type == migrate_misfit)
+	if (env->migration_type == migrate_misfit ||
+	    env->migration_type == migrate_llc_task)
 		return 1;
 
 	return 0;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 15/22] sched/cache: Respect LLC preference in task migration and detach
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (13 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 14/22] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 16/22] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

During load balancing, make can_migrate_task()
consider a task's LLC preference.
Prevent a task from being moved out of its preferred LLC.

During the regular load balancing, if
the task cannot be migrated due to LLC locality, the
nr_balance_failed also should not be increased.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
       Fix the bug in migrate_degrades_llc() that p->preferred_llc should
       be used for comparison rather than task_llc(p).
       (Madadi Vineeth Reddy)
    
       Let nr_balance_failed overwrite cache-aware migration if the
       former is too high. (Peter Zijlstra, K Prateek Nayak)

 kernel/sched/fair.c  | 83 +++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h | 13 +++++++
 2 files changed, 91 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fef916afa1d5..9541e94370e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9777,6 +9777,7 @@ enum migration_type {
 #define LBF_DST_PINNED  0x04
 #define LBF_SOME_PINNED	0x08
 #define LBF_ACTIVE_LB	0x10
+#define LBF_LLC_PINNED	0x20
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -10089,8 +10090,8 @@ static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
  * Check if task p can migrate from source LLC to
  * destination LLC in terms of cache aware load balance.
  */
-static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
-							struct task_struct *p)
+static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
+					 struct task_struct *p)
 {
 	struct mm_struct *mm;
 	bool to_pref;
@@ -10157,6 +10158,46 @@ alb_break_llc(struct lb_env *env)
 
 	return false;
 }
+
+/*
+ * Check if migrating task p from env->src_cpu to
+ * env->dst_cpu breaks LLC localiy.
+ */
+static bool migrate_degrades_llc(struct task_struct *p, struct lb_env *env)
+{
+	if (!sched_cache_enabled())
+		return false;
+
+	if (task_has_sched_core(p))
+		return false;
+	/*
+	 * Skip over tasks that would degrade LLC locality;
+	 * only when nr_balanced_failed is sufficiently high do we
+	 * ignore this constraint.
+	 *
+	 * Threshold of cache_nice_tries is set to 1 higher
+	 * than nr_balance_failed to avoid excessive task
+	 * migration at the same time.
+	 */
+	if (env->sd->nr_balance_failed >= env->sd->cache_nice_tries + 1)
+		return false;
+
+	/*
+	 * We know the env->src_cpu has some tasks prefer to
+	 * run on env->dst_cpu, skip the tasks do not prefer
+	 * env->dst_cpu, and find the one that prefers.
+	 */
+	if (env->migration_type == migrate_llc_task &&
+	    READ_ONCE(p->preferred_llc) != llc_id(env->dst_cpu))
+		return true;
+
+	if (can_migrate_llc_task(env->src_cpu,
+				 env->dst_cpu, p) != mig_forbid)
+		return false;
+
+	return true;
+}
+
 #else
 static inline bool get_llc_stats(int cpu, unsigned long *util,
 				 unsigned long *cap)
@@ -10169,6 +10210,12 @@ alb_break_llc(struct lb_env *env)
 {
 	return false;
 }
+
+static inline bool
+migrate_degrades_llc(struct task_struct *p, struct lb_env *env)
+{
+	return false;
+}
 #endif
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
@@ -10266,10 +10313,29 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 1;
 
 	degrades = migrate_degrades_locality(p, env);
-	if (!degrades)
+	if (!degrades) {
+		/*
+		 * If the NUMA locality is not broken,
+		 * further check if migration would hurt
+		 * LLC locality.
+		 */
+		if (migrate_degrades_llc(p, env)) {
+			/*
+			 * If regular load balancing fails to pull a task
+			 * due to LLC locality, this is expected behavior
+			 * and we set LBF_LLC_PINNED so we don't increase
+			 * nr_balance_failed unecessarily.
+			 */
+			if (env->migration_type != migrate_llc_task)
+				env->flags |= LBF_LLC_PINNED;
+
+			return 0;
+		}
+
 		hot = task_hot(p, env);
-	else
+	} else {
 		hot = degrades > 0;
+	}
 
 	if (!hot || env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
 		if (hot)
@@ -12910,9 +12976,16 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 		 *
 		 * Similarly for migration_misfit which is not related to
 		 * load/util migration, don't pollute nr_balance_failed.
+		 *
+		 * The same for cache aware scheduling's allowance for
+		 * load imbalance. If regular load balance does not
+		 * migrate task due to LLC locality, it is a expected
+		 * behavior and don't pollute nr_balance_failed.
+		 * See can_migrate_task().
 		 */
 		if (idle != CPU_NEWLY_IDLE &&
-		    env.migration_type != migrate_misfit)
+		    env.migration_type != migrate_misfit &&
+		    !(env.flags & LBF_LLC_PINNED))
 			sd->nr_balance_failed++;
 
 		if (need_active_balance(&env)) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 081f23a48414..511c85572b96 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1546,6 +1546,14 @@ extern void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags);
 extern void sched_core_get(void);
 extern void sched_core_put(void);
 
+static inline bool task_has_sched_core(struct task_struct *p)
+{
+	if (sched_core_disabled())
+		return false;
+
+	return !!p->core_cookie;
+}
+
 #else /* !CONFIG_SCHED_CORE: */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1586,6 +1594,11 @@ static inline bool sched_group_cookie_match(struct rq *rq,
 	return true;
 }
 
+static inline bool task_has_sched_core(struct task_struct *p)
+{
+	return false;
+}
+
 #endif /* !CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_RT_GROUP_SCHED
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 16/22] sched/cache: Disable cache aware scheduling for processes with high thread counts
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (14 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 15/22] sched/cache: Respect LLC preference in task migration and detach Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 17/22] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

A performance regression was observed by Prateek when running hackbench
with many threads per process (high fd count). To avoid this, processes
with a large number of active threads are excluded from cache-aware
scheduling.

With sched_cache enabled, record the number of active threads in each
process during the periodic task_cache_work(). While iterating over
CPUs, if the currently running task belongs to the same process as the
task that launched task_cache_work(), increment the active thread count.

If the number of active threads within the process exceeds the number
of Cores(divided by SMTs number) in the LLC, do not enable cache-aware
scheduling. However, on system with smaller number of CPUs within 1 LLC,
like Power10/Power11 with SMT4 and LLC size of 4, this check effectively
disables cache-aware scheduling for any process. One possible solution
suggested by Peter is to use a LLC-mask instead of a single LLC value
for preference. Once there are a 'few' LLCs as preference, this constraint
becomes a little easier. It could be an enhancement in the future.

For users who wish to perform task aggregation regardless, a debugfs knob
is provided for tuning in a subsequent patch.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
       Use cpu_smt_num_threads instead of cpumask_weight(cpu_smt_mask(cpu))
       (Peter Zijlstra)

 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 54 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 526108acc483..dfa4bfd099c6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2392,6 +2392,7 @@ struct sched_cache_stat {
 	struct sched_cache_time __percpu *pcpu_sched;
 	raw_spinlock_t lock;
 	unsigned long epoch;
+	u64 nr_running_avg;
 	int cpu;
 } ____cacheline_aligned_in_smp;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9541e94370e7..077ae7875e2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1316,6 +1316,12 @@ static inline bool valid_llc_buf(struct sched_domain *sd,
 	return true;
 }
 
+static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
+{
+	return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
+			per_cpu(sd_llc_size, cpu));
+}
+
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct sched_domain *sd;
@@ -1507,7 +1513,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	 */
 	if (time_after(epoch,
 		       READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) ||
-	    get_nr_threads(p) <= 1) {
+	    get_nr_threads(p) <= 1 ||
+	    exceed_llc_nr(mm, cpu_of(rq))) {
 		if (mm->sc_stat.cpu != -1)
 			mm->sc_stat.cpu = -1;
 	}
@@ -1592,13 +1599,31 @@ static void get_scan_cpumasks(cpumask_var_t cpus, struct task_struct *p)
 	cpumask_copy(cpus, cpu_online_mask);
 }
 
+static inline void update_avg_scale(u64 *avg, u64 sample)
+{
+	int factor = per_cpu(sd_llc_size, raw_smp_processor_id());
+	s64 diff = sample - *avg;
+	u32 divisor;
+
+	/*
+	 * Scale the divisor based on the number of CPUs contained
+	 * in the LLC. This scaling ensures smaller LLC domains use
+	 * a smaller divisor to achieve more precise sensitivity to
+	 * changes in nr_running, while larger LLC domains are capped
+	 * at a maximum divisor of 8 which is the default smoothing
+	 * factor of EWMA in update_avg().
+	 */
+	divisor = clamp_t(u32, (factor >> 2), 2, 8);
+	*avg += div64_s64(diff, divisor);
+}
+
 static void task_cache_work(struct callback_head *work)
 {
-	struct task_struct *p = current;
+	struct task_struct *p = current, *cur;
+	int cpu, m_a_cpu = -1, nr_running = 0;
+	unsigned long curr_m_a_occ = 0;
 	struct mm_struct *mm = p->mm;
 	unsigned long m_a_occ = 0;
-	unsigned long curr_m_a_occ = 0;
-	int cpu, m_a_cpu = -1;
 	cpumask_var_t cpus;
 
 	WARN_ON_ONCE(work != &p->cache_work);
@@ -1608,6 +1633,13 @@ static void task_cache_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	if (get_nr_threads(p) <= 1) {
+		if (mm->sc_stat.cpu != -1)
+			mm->sc_stat.cpu = -1;
+
+		return;
+	}
+
 	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
 		return;
 
@@ -1631,6 +1663,12 @@ static void task_cache_work(struct callback_head *work)
 					m_occ = occ;
 					m_cpu = i;
 				}
+				scoped_guard (rcu) {
+					cur = rcu_dereference_all(cpu_rq(i)->curr);
+					if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
+					    cur->mm == mm)
+						nr_running++;
+				}
 			}
 
 			/*
@@ -1674,6 +1712,7 @@ static void task_cache_work(struct callback_head *work)
 		mm->sc_stat.cpu = m_a_cpu;
 	}
 
+	update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running);
 	free_cpumask_var(cpus);
 }
 
@@ -10105,6 +10144,13 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
 
+	/* skip cache aware load balance for single/too many threads */
+	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu)) {
+		if (mm->sc_stat.cpu != -1)
+			mm->sc_stat.cpu = -1;
+		return mig_unrestricted;
+	}
+
 	if (cpus_share_cache(dst_cpu, cpu))
 		to_pref = true;
 	else if (cpus_share_cache(src_cpu, cpu))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 17/22] sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (15 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 16/22] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 18/22] sched/cache: Enable cache aware scheduling for multi LLCs NUMA node Tim Chen
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.

To mitigate this, estimate a process's memory footprint by comparing
its RSS (anonymous and shared pages) to the size of the LLC. If RSS
exceeds the LLC size, skip cache-aware scheduling.

Note that RSS is only an approximation of the memory footprint.
By default, the comparison is strict, but a later patch will allow
users to provide a hint to adjust this threshold.

According to the test from Adam, some systems do not have shared L3
but with shared L2 as clusters. In this case, the L2 becomes the LLC[1].

Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@os.amperecomputing.com/

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Vern Hao <vernhao@tencent.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
       No change.

 include/linux/cacheinfo.h | 21 ++++++++++-------
 kernel/sched/fair.c       | 48 +++++++++++++++++++++++++++++++++++----
 2 files changed, 56 insertions(+), 13 deletions(-)

diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index c8f4f0a0b874..82d0d59ca0e1 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
 
 const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf);
 
-/*
- * Get the cacheinfo structure for the cache associated with @cpu at
- * level @level.
- * cpuhp lock must be held.
- */
-static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int level)
 {
 	struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
 	int i;
 
-	lockdep_assert_cpus_held();
-
 	for (i = 0; i < ci->num_leaves; i++) {
 		if (ci->info_list[i].level == level) {
 			if (ci->info_list[i].attributes & CACHE_ID)
@@ -136,6 +129,18 @@ static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
 	return NULL;
 }
 
+/*
+ * Get the cacheinfo structure for the cache associated with @cpu at
+ * level @level.
+ * cpuhp lock must be held.
+ */
+static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+{
+	lockdep_assert_cpus_held();
+
+	return _get_cpu_cacheinfo_level(cpu, level);
+}
+
 /*
  * Get the id of the cache associated with @cpu at level @level.
  * cpuhp lock must be held.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 077ae7875e2e..a2d1b8b2a188 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1316,6 +1316,37 @@ static inline bool valid_llc_buf(struct sched_domain *sd,
 	return true;
 }
 
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+{
+	struct cacheinfo *ci;
+	u64 rss, llc;
+
+	/*
+	 * get_cpu_cacheinfo_level() can not be used
+	 * because it requires the cpu_hotplug_lock
+	 * to be held. Use _get_cpu_cacheinfo_level()
+	 * directly because the 'cpu' can not be
+	 * offlined at the moment.
+	 */
+	ci = _get_cpu_cacheinfo_level(cpu, 3);
+	if (!ci) {
+		/*
+		 * On system without L3 but with shared L2,
+		 * L2 becomes the LLC.
+		 */
+		ci = _get_cpu_cacheinfo_level(cpu, 2);
+		if (!ci)
+			return true;
+	}
+
+	llc = ci->size;
+
+	rss = get_mm_counter(mm, MM_ANONPAGES) +
+		get_mm_counter(mm, MM_SHMEMPAGES);
+
+	return (llc < (rss * PAGE_SIZE));
+}
+
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
 	return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
@@ -1514,7 +1545,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	if (time_after(epoch,
 		       READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) ||
 	    get_nr_threads(p) <= 1 ||
-	    exceed_llc_nr(mm, cpu_of(rq))) {
+	    exceed_llc_nr(mm, cpu_of(rq)) ||
+	    exceed_llc_capacity(mm, cpu_of(rq))) {
 		if (mm->sc_stat.cpu != -1)
 			mm->sc_stat.cpu = -1;
 	}
@@ -1619,8 +1651,8 @@ static inline void update_avg_scale(u64 *avg, u64 sample)
 
 static void task_cache_work(struct callback_head *work)
 {
+	int cpu, m_a_cpu = -1, nr_running = 0, curr_cpu;
 	struct task_struct *p = current, *cur;
-	int cpu, m_a_cpu = -1, nr_running = 0;
 	unsigned long curr_m_a_occ = 0;
 	struct mm_struct *mm = p->mm;
 	unsigned long m_a_occ = 0;
@@ -1633,7 +1665,9 @@ static void task_cache_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
-	if (get_nr_threads(p) <= 1) {
+	curr_cpu = task_cpu(p);
+	if (get_nr_threads(p) <= 1 ||
+	    exceed_llc_capacity(mm, curr_cpu)) {
 		if (mm->sc_stat.cpu != -1)
 			mm->sc_stat.cpu = -1;
 
@@ -10144,8 +10178,12 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
 
-	/* skip cache aware load balance for single/too many threads */
-	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu)) {
+	/*
+	 * Skip cache aware load balance for single/too many threads
+	 * or large memory RSS.
+	 */
+	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
+	    exceed_llc_capacity(mm, dst_cpu)) {
 		if (mm->sc_stat.cpu != -1)
 			mm->sc_stat.cpu = -1;
 		return mig_unrestricted;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 18/22] sched/cache: Enable cache aware scheduling for multi LLCs NUMA node
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (16 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 17/22] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 19/22] sched/cache: Allow the user space to turn on and off cache aware scheduling Tim Chen
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Introduce sched_cache_present to enable cache aware scheduling for
multi LLCs NUMA node Cache-aware load balancing should only be
enabled if there are more than 1 LLCs within 1 NUMA node.
sched_cache_present is introduced to indicate whether this
platform supports this topology.

Test results:
The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.

The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.

hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched
on these two platforms.

[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows limitted wakeup latency improvement.
ChaCha20-xiangshan(risc-v simulator) shows good throughput
improvement. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Genoa:
Significant improvement is observed in hackbench when
the active number of threads is lower than the number
of CPUs within 1 LLC. On v2, Aaron reported improvement
of hackbench/redis when system is underloaded.
ChaCha20-xiangshan shows huge throughput improvement.
Phoronix has tested v1 and shows good improvements in 30+
cases[3]. No obvious difference was observed in
netperf/stream/stress-ng in Hmean.

Detail:
Due to length constraints, data without much difference with
baseline is not presented.

Sapphire Rapids:
[hackbench pipe]
================
case                    load            baseline(std%)  compare%( std%)
threads-pipe-10         1-groups         1.00 (  1.22)  +26.09 (  1.10)
threads-pipe-10         2-groups         1.00 (  4.90)  +22.88 (  0.18)
threads-pipe-10         4-groups         1.00 (  2.07)   +9.00 (  3.49)
threads-pipe-10         8-groups         1.00 (  8.13)   +3.45 (  3.62)
threads-pipe-16         1-groups         1.00 (  2.11)  +26.30 (  0.08)
threads-pipe-16         2-groups         1.00 ( 15.13)   -1.77 ( 11.89)
threads-pipe-16         4-groups         1.00 (  4.37)   +0.58 (  7.99)
threads-pipe-16         8-groups         1.00 (  2.88)   +2.71 (  3.50)
threads-pipe-2          1-groups         1.00 (  9.40)  +22.07 (  0.71)
threads-pipe-2          2-groups         1.00 (  9.99)  +18.01 (  0.95)
threads-pipe-2          4-groups         1.00 (  3.98)  +24.66 (  0.96)
threads-pipe-2          8-groups         1.00 (  7.00)  +21.83 (  0.23)
threads-pipe-20         1-groups         1.00 (  1.03)  +28.84 (  0.21)
threads-pipe-20         2-groups         1.00 (  4.42)  +31.90 (  3.15)
threads-pipe-20         4-groups         1.00 (  9.97)   +4.56 (  1.69)
threads-pipe-20         8-groups         1.00 (  1.87)   +1.25 (  0.74)
threads-pipe-4          1-groups         1.00 (  4.48)  +25.67 (  0.78)
threads-pipe-4          2-groups         1.00 (  9.14)   +4.91 (  2.08)
threads-pipe-4          4-groups         1.00 (  7.68)  +19.36 (  1.53)
threads-pipe-4          8-groups         1.00 ( 10.79)   +7.20 ( 12.20)
threads-pipe-8          1-groups         1.00 (  4.69)  +21.93 (  0.03)
threads-pipe-8          2-groups         1.00 (  1.16)  +25.29 (  0.65)
threads-pipe-8          4-groups         1.00 (  2.23)   -1.27 (  3.62)
threads-pipe-8          8-groups         1.00 (  4.65)   -3.08 (  2.75)

Note: The default number of fd in hackbench is changed from 20 to various
values to ensure that threads fit within a single LLC, especially on AMD
systems. Take "threads-pipe-8, 2-groups" for example, the number of fd
is 8, and 2 groups are created.

[schbench]
The 99th percentile wakeup latency shows some improvements when the
system is underload, while it does not bring much difference with
the increasing of system utilization.

99th Wakeup Latencies	Base (mean std)      Compare (mean std)   Change
=========================================================================
thread=2                 9.00(0.00)           9.00(1.73)           0.00%
thread=4                 7.33(0.58)           6.33(0.58)           +13.64%
thread=8                 9.00(0.00)           7.67(1.15)           +14.78%
thread=16                8.67(0.58)           8.67(1.53)           0.00%
thread=32                9.00(0.00)           7.00(0.00)           +22.22%
thread=64                9.33(0.58)           9.67(0.58)           -3.64%
thread=128              12.00(0.00)          12.00(0.00)           0.00%

[chacha20 on simulated risc-v]
baseline:
Host time spent: 67861ms
cache aware scheduling enabled:
Host time spent: 54441ms

Time reduced by 24%

Genoa:
[hackbench pipe]
The default number of fd is 20, which exceed the number of CPUs
in a LLC. So the fd is adjusted to 2, 4, 6, 8, 20 respectively.
Exclude the result with large run-to-run variance, 10% ~ 50%
improvement is observed when the system is underloaded:

[hackbench pipe]
================
case                    load            baseline(std%)  compare%( std%)
threads-pipe-2          1-groups         1.00 (  2.89)  +47.33 (  1.20)
threads-pipe-2          2-groups         1.00 (  3.88)  +39.82 (  0.61)
threads-pipe-2          4-groups         1.00 (  8.76)   +5.57 ( 13.10)
threads-pipe-20         1-groups         1.00 (  4.61)  +11.72 (  1.06)
threads-pipe-20         2-groups         1.00 (  6.18)  +14.55 (  1.47)
threads-pipe-20         4-groups         1.00 (  2.99)  +10.16 (  4.49)
threads-pipe-4          1-groups         1.00 (  4.23)  +43.70 (  2.14)
threads-pipe-4          2-groups         1.00 (  3.68)   +8.45 (  4.04)
threads-pipe-4          4-groups         1.00 ( 17.72)   +2.42 (  1.14)
threads-pipe-6          1-groups         1.00 (  3.10)   +7.74 (  3.83)
threads-pipe-6          2-groups         1.00 (  3.42)  +14.26 (  4.53)
threads-pipe-6          4-groups         1.00 ( 10.34)  +10.94 (  7.12)
threads-pipe-8          1-groups         1.00 (  4.21)   +9.06 (  4.43)
threads-pipe-8          2-groups         1.00 (  1.88)   +3.74 (  0.58)
threads-pipe-8          4-groups         1.00 (  2.78)  +23.96 (  1.18)

[chacha20 on simulated risc-v]
Host time spent: 54762ms
Host time spent: 28295ms

Time reduced by 48%

Suggested-by: Libo Chen <libchen@purestorage.com>
Suggested-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
        Add test results into commit log.

 kernel/sched/sched.h    |  4 +++-
 kernel/sched/topology.c | 18 ++++++++++++++++--
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 511c85572b96..518c798231ac 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4035,9 +4035,11 @@ static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct
 #endif /* !CONFIG_SCHED_MM_CID */
 
 #ifdef CONFIG_SCHED_CACHE
+DECLARE_STATIC_KEY_FALSE(sched_cache_present);
+
 static inline bool sched_cache_enabled(void)
 {
-	return false;
+	return static_branch_unlikely(&sched_cache_present);
 }
 #endif
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 995a42cb4697..0b1fc1b0709d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -809,6 +809,7 @@ enum s_alloc {
 };
 
 #ifdef CONFIG_SCHED_CACHE
+DEFINE_STATIC_KEY_FALSE(sched_cache_present);
 static bool alloc_sd_llc(const struct cpumask *cpu_map,
 			 struct s_data *d)
 {
@@ -2674,6 +2675,7 @@ static int
 build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr)
 {
 	enum s_alloc alloc_state = sa_none;
+	bool has_multi_llcs = false;
 	struct sched_domain *sd;
 	struct s_data d;
 	struct rq *rq = NULL;
@@ -2784,10 +2786,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 				 * between LLCs and memory channels.
 				 */
 				nr_llcs = sd->span_weight / child->span_weight;
-				if (nr_llcs == 1)
+				if (nr_llcs == 1) {
 					imb = sd->span_weight >> 3;
-				else
+				} else {
 					imb = nr_llcs;
+					has_multi_llcs = true;
+				}
 				imb = max(1U, imb);
 				sd->imb_numa_nr = imb;
 
@@ -2842,6 +2846,16 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 
 	ret = 0;
 error:
+#ifdef CONFIG_SCHED_CACHE
+	/*
+	 * TBD: check before writing to it. sched domain rebuild
+	 * is not in the critical path, leave as-is for now.
+	 */
+	if (!ret && has_multi_llcs)
+		static_branch_enable_cpuslocked(&sched_cache_present);
+	else
+		static_branch_disable_cpuslocked(&sched_cache_present);
+#endif
 	__free_domain_allocs(&d, alloc_state, cpu_map);
 
 	return ret;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 19/22] sched/cache: Allow the user space to turn on and off cache aware scheduling
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (17 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 18/22] sched/cache: Enable cache aware scheduling for multi LLCs NUMA node Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 20/22] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Tim Chen
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Provide a debugfs directory llc_balancing, and a knob named
"enabled" under it to allow the user to turn off and on the
cache aware scheduling at runtime.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
      Create the debugfs knobs under debug/sched/llc_balancing directory.
      (Peter Zijlstra)

 kernel/sched/debug.c    | 48 +++++++++++++++++++++++++++++-
 kernel/sched/sched.h    |  7 ++++-
 kernel/sched/topology.c | 65 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 118 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index b24f40f05019..3019412d8009 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -209,6 +209,46 @@ static const struct file_operations sched_scaling_fops = {
 	.release	= single_release,
 };
 
+#ifdef CONFIG_SCHED_CACHE
+static ssize_t
+sched_cache_enable_write(struct file *filp, const char __user *ubuf,
+			 size_t cnt, loff_t *ppos)
+{
+	bool val;
+	int ret;
+
+	ret = kstrtobool_from_user(ubuf, cnt, &val);
+	if (ret)
+		return ret;
+
+	sysctl_sched_cache_user = val;
+
+	sched_cache_active_set_unlocked();
+
+	return cnt;
+}
+
+static int sched_cache_enable_show(struct seq_file *m, void *v)
+{
+	seq_printf(m, "%d\n", sysctl_sched_cache_user);
+	return 0;
+}
+
+static int sched_cache_enable_open(struct inode *inode,
+				   struct file *filp)
+{
+	return single_open(filp, sched_cache_enable_show, NULL);
+}
+
+static const struct file_operations sched_cache_enable_fops = {
+	.open           = sched_cache_enable_open,
+	.write          = sched_cache_enable_write,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+#endif
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 
 static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
@@ -592,7 +632,7 @@ static void debugfs_ext_server_init(void)
 
 static __init int sched_init_debug(void)
 {
-	struct dentry __maybe_unused *numa;
+	struct dentry __maybe_unused *numa, *llc;
 
 	debugfs_sched = debugfs_create_dir("sched", NULL);
 
@@ -625,6 +665,12 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_SCHED_CACHE
+	llc = debugfs_create_dir("llc_balancing", debugfs_sched);
+	debugfs_create_file("enabled", 0644, llc, NULL,
+			    &sched_cache_enable_fops);
+#endif
+
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
 	debugfs_fair_server_init();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 518c798231ac..5561bdcc8bf5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4036,11 +4036,16 @@ static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct
 
 #ifdef CONFIG_SCHED_CACHE
 DECLARE_STATIC_KEY_FALSE(sched_cache_present);
+DECLARE_STATIC_KEY_FALSE(sched_cache_active);
+extern int sysctl_sched_cache_user;
 
 static inline bool sched_cache_enabled(void)
 {
-	return static_branch_unlikely(&sched_cache_present);
+	return static_branch_unlikely(&sched_cache_active);
 }
+
+extern void sched_cache_active_set_unlocked(void);
+
 #endif
 
 void sched_domains_free_llc_id(int cpu);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 0b1fc1b0709d..ceb17ef31ef6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -809,7 +809,16 @@ enum s_alloc {
 };
 
 #ifdef CONFIG_SCHED_CACHE
+/* hardware support for cache aware scheduling */
 DEFINE_STATIC_KEY_FALSE(sched_cache_present);
+/*
+ * Indicator of whether cache aware scheduling
+ * is active, used by the scheduler.
+ */
+DEFINE_STATIC_KEY_FALSE(sched_cache_active);
+/* user wants cache aware scheduling [0 or 1] */
+int sysctl_sched_cache_user = 1;
+
 static bool alloc_sd_llc(const struct cpumask *cpu_map,
 			 struct s_data *d)
 {
@@ -843,6 +852,60 @@ static bool alloc_sd_llc(const struct cpumask *cpu_map,
 
 	return false;
 }
+
+static void _sched_cache_active_set(bool enable, bool locked)
+{
+	if (enable) {
+		if (locked)
+			static_branch_enable_cpuslocked(&sched_cache_active);
+		else
+			static_branch_enable(&sched_cache_active);
+	} else {
+		if (locked)
+			static_branch_disable_cpuslocked(&sched_cache_active);
+		else
+			static_branch_disable(&sched_cache_active);
+	}
+}
+
+/*
+ * Enable/disable cache aware scheduling according to
+ * user input and the presence of hardware support.
+ */
+static void sched_cache_active_set(bool locked)
+{
+	/* hardware does not support */
+	if (!static_branch_likely(&sched_cache_present)) {
+		_sched_cache_active_set(false, locked);
+		return;
+	}
+
+	/*
+	 * user wants it or not ?
+	 * TBD: read before writing the static key.
+	 * It is not in the critical path, leave as-is
+	 * for now.
+	 */
+	if (sysctl_sched_cache_user) {
+		_sched_cache_active_set(true, locked);
+		if (sched_debug())
+			pr_info("%s: enabling cache aware scheduling\n", __func__);
+	} else {
+		_sched_cache_active_set(false, locked);
+		if (sched_debug())
+			pr_info("%s: disabling cache aware scheduling\n", __func__);
+	}
+}
+
+static void sched_cache_active_set_locked(void)
+{
+	return sched_cache_active_set(true);
+}
+
+void sched_cache_active_set_unlocked(void)
+{
+	return sched_cache_active_set(false);
+}
 #else
 static bool alloc_sd_llc(const struct cpumask *cpu_map,
 			 struct s_data *d)
@@ -2855,6 +2918,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		static_branch_enable_cpuslocked(&sched_cache_present);
 	else
 		static_branch_disable_cpuslocked(&sched_cache_present);
+
+	sched_cache_active_set_locked();
 #endif
 	__free_domain_allocs(&d, alloc_state, cpu_map);
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 20/22] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (18 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 19/22] sched/cache: Allow the user space to turn on and off cache aware scheduling Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 21/22] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Tim Chen
  2026-04-01 21:52 ` [Patch v4 22/22] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Tim Chen
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Introduce a set of debugfs knobs to control how aggressive the
cache aware scheduling do the task aggregation.

(1) aggr_tolerance
With sched_cache enabled, the scheduler uses a process's RSS as a
proxy for its LLC footprint to determine if aggregating tasks on the
preferred LLC could cause cache contention. If RSS exceeds the LLC
size, aggregation is skipped. Some workloads with large RSS but small
actual memory footprints may still benefit from aggregation. Since
the kernel cannot efficiently track per-task cache usage (resctrl is
user-space only), userspace can provide a more accurate hint.

Introduce /sys/kernel/debug/sched/llc_balancing/aggr_tolerance to
let users control how strictly RSS limits aggregation. Values range
from 0 to 100:
  - 0: Cache-aware scheduling is disabled.
  - 1: Strict; tasks with RSS larger than LLC size are skipped.
  - >=100: Aggressive; tasks are aggregated regardless of RSS.
For example, with a 32MB L3 cache:

  - aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
  - aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
    (784GB = (1 + (99 - 1) * 256) * 32MB).
Similarly, /sys/kernel/debug/sched/llc_balancing/aggr_tolerance also
controls how strictly the number of active threads is considered when
doing cache aware load balance. The number of SMTs is also considered.
High SMT counts reduce the aggregation capacity, preventing excessive
task aggregation on SMT-heavy systems like Power10/Power11.

Yangyu suggested introducing separate aggregation controls for the
number of active threads and memory RSS checks. Since there are plans
to add per-process/task group controls, fine-grained tunables are
deferred to that implementation.

(2) epoch_period, epoch_affinity_timeout,
    imb_pct, overaggr_pct are also turned into tunable.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: Jianyong Wu <jianyong.wu@outlook.com>
Suggested-by: Yangyu Chen <cyy@cyyself.name>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
      Create the debugfs knobs under debug/sched/llc_balancing directory.
      (Peter Zijlstra)

 kernel/sched/debug.c | 10 ++++++++
 kernel/sched/fair.c  | 60 ++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |  5 ++++
 3 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 3019412d8009..4469e1c152c8 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -669,6 +669,16 @@ static __init int sched_init_debug(void)
 	llc = debugfs_create_dir("llc_balancing", debugfs_sched);
 	debugfs_create_file("enabled", 0644, llc, NULL,
 			    &sched_cache_enable_fops);
+	debugfs_create_u32("aggr_tolerance", 0644, llc,
+			   &llc_aggr_tolerance);
+	debugfs_create_u32("epoch_period", 0644, llc,
+			   &llc_epoch_period);
+	debugfs_create_u32("epoch_affinity_timeout", 0644, llc,
+			   &llc_epoch_affinity_timeout);
+	debugfs_create_u32("overaggr_pct", 0644, llc,
+			   &llc_overaggr_pct);
+	debugfs_create_u32("imb_pct", 0644, llc,
+			   &llc_imb_pct);
 #endif
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a2d1b8b2a188..e4e22696a0b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1282,6 +1282,11 @@ static void set_next_buddy(struct sched_entity *se);
  */
 #define EPOCH_PERIOD	(HZ / 100)	/* 10 ms */
 #define EPOCH_LLC_AFFINITY_TIMEOUT	5	/* 50 ms */
+__read_mostly unsigned int llc_aggr_tolerance	= 1;
+__read_mostly unsigned int llc_epoch_period	= EPOCH_PERIOD;
+__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
+__read_mostly unsigned int llc_imb_pct		= 20;
+__read_mostly unsigned int llc_overaggr_pct	= 50;
 
 static int llc_id(int cpu)
 {
@@ -1316,10 +1321,22 @@ static inline bool valid_llc_buf(struct sched_domain *sd,
 	return true;
 }
 
+static inline int get_sched_cache_scale(int mul)
+{
+	if (!llc_aggr_tolerance)
+		return 0;
+
+	if (llc_aggr_tolerance >= 100)
+		return INT_MAX;
+
+	return (1 + (llc_aggr_tolerance - 1) * mul);
+}
+
 static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 {
 	struct cacheinfo *ci;
 	u64 rss, llc;
+	int scale;
 
 	/*
 	 * get_cpu_cacheinfo_level() can not be used
@@ -1344,13 +1361,42 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 	rss = get_mm_counter(mm, MM_ANONPAGES) +
 		get_mm_counter(mm, MM_SHMEMPAGES);
 
-	return (llc < (rss * PAGE_SIZE));
+	/*
+	 * Scale the LLC size by 256*llc_aggr_tolerance
+	 * and compare it to the task's RSS size.
+	 *
+	 * Suppose the L3 size is 32MB. If the
+	 * llc_aggr_tolerance is 1:
+	 * When the RSS is larger than 32MB, the process
+	 * is regarded as exceeding the LLC capacity. If
+	 * the llc_aggr_tolerance is 99:
+	 * When the RSS is larger than 784GB, the process
+	 * is regarded as exceeding the LLC capacity:
+	 * 784GB = (1 + (99 - 1) * 256) * 32MB
+	 * If the llc_aggr_tolerance is 100:
+	 * ignore the RSS.
+	 */
+	scale = get_sched_cache_scale(256);
+	if (scale == INT_MAX)
+		return false;
+
+	return ((llc * (u64)scale) < (rss * PAGE_SIZE));
 }
 
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
+	int scale;
+
+	/*
+	 * Scale the number of 'cores' in a LLC by llc_aggr_tolerance
+	 * and compare it to the task's active threads.
+	 */
+	scale = get_sched_cache_scale(1);
+	if (scale == INT_MAX)
+		return false;
+
 	return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
-			per_cpu(sd_llc_size, cpu));
+			(scale * per_cpu(sd_llc_size, cpu)));
 }
 
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
@@ -1448,9 +1494,9 @@ static inline void __update_mm_sched(struct rq *rq,
 	long delta = now - rq->cpu_epoch_next;
 
 	if (delta > 0) {
-		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		n = (delta + llc_epoch_period - 1) / max(llc_epoch_period, 1U);
 		rq->cpu_epoch += n;
-		rq->cpu_epoch_next += n * EPOCH_PERIOD;
+		rq->cpu_epoch_next += n * llc_epoch_period;
 		__shr_u64(&rq->cpu_runtime, n);
 	}
 
@@ -1543,7 +1589,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	 * has only 1 thread, invalidate its preferred state.
 	 */
 	if (time_after(epoch,
-		       READ_ONCE(mm->sc_stat.epoch) + EPOCH_LLC_AFFINITY_TIMEOUT) ||
+		       READ_ONCE(mm->sc_stat.epoch) + llc_epoch_affinity_timeout) ||
 	    get_nr_threads(p) <= 1 ||
 	    exceed_llc_nr(mm, cpu_of(rq)) ||
 	    exceed_llc_capacity(mm, cpu_of(rq))) {
@@ -10018,7 +10064,7 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
  */
 static bool fits_llc_capacity(unsigned long util, unsigned long max)
 {
-	u32 aggr_pct = 50;
+	u32 aggr_pct = llc_overaggr_pct;
 
 	/*
 	 * For single core systems, raise the aggregation
@@ -10038,7 +10084,7 @@ static bool fits_llc_capacity(unsigned long util, unsigned long max)
  */
 /* Allows dst util to be bigger than src util by up to bias percent */
 #define util_greater(util1, util2) \
-	((util1) * 100 > (util2) * 120)
+	((util1) * 100 > (util2) * (100 + llc_imb_pct))
 
 static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
 					 unsigned long *cap)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5561bdcc8bf5..b757812725f7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4038,6 +4038,11 @@ static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct
 DECLARE_STATIC_KEY_FALSE(sched_cache_present);
 DECLARE_STATIC_KEY_FALSE(sched_cache_active);
 extern int sysctl_sched_cache_user;
+extern unsigned int llc_aggr_tolerance;
+extern unsigned int llc_epoch_period;
+extern unsigned int llc_epoch_affinity_timeout;
+extern unsigned int llc_imb_pct;
+extern unsigned int llc_overaggr_pct;
 
 static inline bool sched_cache_enabled(void)
 {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 21/22] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (19 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 20/22] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  2026-04-01 21:52 ` [Patch v4 22/22] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Tim Chen
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Debug patch only.

Show the per-LLC occupancy in /proc/{PID}/schedstat, with each column
corresponding to one LLC. This can be used to verify if the cache-aware
load balancer works as expected by aggregating threads onto dedicated LLCs.

Suppose there are 2 LLCs and the sampling duration is 10 seconds:

Enable the cache aware load balance, LLC0 residency delta is 0,
LLC1 is 12 seconds:
0 12281

disable the cache aware load balance, tasks are spread to 2
LLCs:
9299 5435

Note: The race condition is not properly dealt with in this patch;
out-of-bounds access may be triggered during CPU hotplug due to
max_llcs changes. Do not query per-LLC occupancy during CPU hotplug.

Co-developed-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
       Minor fix.

 fs/proc/base.c           | 31 +++++++++++++++++++++++++
 include/linux/mm_types.h | 17 +++++++++++---
 include/linux/sched.h    |  6 +++++
 kernel/sched/fair.c      | 50 ++++++++++++++++++++++++++++++++++++----
 4 files changed, 97 insertions(+), 7 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 4c863d17dfb4..42629f33e0fa 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -518,6 +518,37 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
 		   (unsigned long long)task->se.sum_exec_runtime,
 		   (unsigned long long)task->sched_info.run_delay,
 		   task->sched_info.pcount);
+#ifdef CONFIG_SCHED_CACHE
+	if (sched_cache_inuse()) {
+		struct mm_struct *mm = task->mm;
+		u64 *llc_runtime;
+		int mm_sched_llc;
+
+		if (!mm)
+			return 0;
+
+		llc_runtime = kcalloc(max_lid + 1, sizeof(u64), GFP_KERNEL);
+		if (!llc_runtime)
+			return 0;
+
+		if (get_mm_per_llc_runtime(task, llc_runtime))
+			goto out;
+
+		if (mm->sc_stat.cpu == -1)
+			mm_sched_llc = -1;
+		else
+			mm_sched_llc = llc_id(mm->sc_stat.cpu);
+
+		for (int i = 0; i <= max_lid; i++)
+			seq_printf(m, "%s%s%llu ",
+				   i == task->preferred_llc ? "*" : "",
+				   i == mm_sched_llc ? "?" : "",
+				   llc_runtime[i]);
+		seq_puts(m, "\n");
+out:
+		kfree(llc_runtime);
+	}
+#endif
 
 	return 0;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 67b2dfcc71ea..0eda55f29dd2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1579,17 +1579,26 @@ static inline unsigned int mm_cid_size(void)
 
 #ifdef CONFIG_SCHED_CACHE
 void mm_init_sched(struct mm_struct *mm,
-		   struct sched_cache_time __percpu *pcpu_sched);
+		   struct sched_cache_time __percpu *pcpu_sched,
+		   struct sched_cache_time __percpu *pcpu_time);
 
 static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
 {
 	struct sched_cache_time __percpu *pcpu_sched =
-		alloc_percpu_noprof(struct sched_cache_time);
+		alloc_percpu_noprof(struct sched_cache_time),
+		*pcpu_time;
 
 	if (!pcpu_sched)
 		return -ENOMEM;
 
-	mm_init_sched(mm, pcpu_sched);
+	pcpu_time = alloc_percpu_noprof(struct sched_cache_time);
+	if (!pcpu_time) {
+		free_percpu(pcpu_sched);
+		return -ENOMEM;
+	}
+
+	mm_init_sched(mm, pcpu_sched, pcpu_time);
+
 	return 0;
 }
 
@@ -1598,7 +1607,9 @@ static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
 static inline void mm_destroy_sched(struct mm_struct *mm)
 {
 	free_percpu(mm->sc_stat.pcpu_sched);
+	free_percpu(mm->sc_stat.pcpu_time);
 	mm->sc_stat.pcpu_sched = NULL;
+	mm->sc_stat.pcpu_time = NULL;
 }
 #else /* !CONFIG_SCHED_CACHE */
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dfa4bfd099c6..e24b2b86aba4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2390,12 +2390,18 @@ struct sched_cache_time {
 
 struct sched_cache_stat {
 	struct sched_cache_time __percpu *pcpu_sched;
+	struct sched_cache_time __percpu *pcpu_time;
 	raw_spinlock_t lock;
 	unsigned long epoch;
 	u64 nr_running_avg;
 	int cpu;
 } ____cacheline_aligned_in_smp;
 
+int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf);
+bool sched_cache_inuse(void);
+extern int max_lid;
+int llc_id(int cpu);
+
 #else
 
 struct sched_cache_stat { };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4e22696a0b1..2b12918b00fd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1288,7 +1288,12 @@ __read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEO
 __read_mostly unsigned int llc_imb_pct		= 20;
 __read_mostly unsigned int llc_overaggr_pct	= 50;
 
-static int llc_id(int cpu)
+bool sched_cache_inuse(void)
+{
+	return sched_cache_enabled();
+}
+
+int llc_id(int cpu)
 {
 	if (cpu < 0)
 		return -1;
@@ -1448,18 +1453,21 @@ static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
 }
 
 void mm_init_sched(struct mm_struct *mm,
-		   struct sched_cache_time __percpu *_pcpu_sched)
+		   struct sched_cache_time __percpu *_pcpu_sched,
+		   struct sched_cache_time __percpu *_pcpu_time)
 {
 	unsigned long epoch = 0;
 	int i;
 
 	for_each_possible_cpu(i) {
 		struct sched_cache_time *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
+		struct sched_cache_time *pcpu_time = per_cpu_ptr(_pcpu_time, i);
 		struct rq *rq = cpu_rq(i);
 
 		pcpu_sched->runtime = 0;
 		/* a slightly stale cpu epoch is acceptible */
 		pcpu_sched->epoch = rq->cpu_epoch;
+		pcpu_time->runtime = 0;
 		epoch = rq->cpu_epoch;
 	}
 
@@ -1473,6 +1481,8 @@ void mm_init_sched(struct mm_struct *mm,
 	 * the readers may get invalid mm_sched_epoch, etc.
 	 */
 	smp_store_release(&mm->sc_stat.pcpu_sched, _pcpu_sched);
+	/* barrier */
+	smp_store_release(&mm->sc_stat.pcpu_time, _pcpu_time);
 }
 
 /* because why would C be fully specified */
@@ -1558,7 +1568,8 @@ static unsigned int task_running_on_cpu(int cpu, struct task_struct *p);
 static inline
 void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
-	struct sched_cache_time *pcpu_sched;
+	struct sched_cache_time *pcpu_sched,
+		*pcpu_time;
 	struct mm_struct *mm = p->mm;
 	int mm_sched_llc = -1;
 	unsigned long epoch;
@@ -1572,14 +1583,18 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	 * init_task, kthreads and user thread created
 	 * by user_mode_thread() don't have mm.
 	 */
-	if (!mm || !mm->sc_stat.pcpu_sched)
+	if (!mm || !mm->sc_stat.pcpu_sched ||
+	    !mm->sc_stat.pcpu_time)
 		return;
 
 	pcpu_sched = per_cpu_ptr(p->mm->sc_stat.pcpu_sched, cpu_of(rq));
+	pcpu_time = per_cpu_ptr(p->mm->sc_stat.pcpu_time, cpu_of(rq));
 
 	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
 		__update_mm_sched(rq, pcpu_sched);
 		pcpu_sched->runtime += delta_exec;
+		/* pure runtime without decay */
+		pcpu_time->runtime += delta_exec;
 		rq->cpu_runtime += delta_exec;
 		epoch = rq->cpu_epoch;
 	}
@@ -10328,6 +10343,33 @@ static bool migrate_degrades_llc(struct task_struct *p, struct lb_env *env)
 	return true;
 }
 
+/* p->pi_lock is hold */
+int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf)
+{
+	struct sched_cache_time *pcpu_time;
+	struct mm_struct *mm = p->mm;
+	int cpu;
+
+	if (!mm)
+		return -EINVAL;
+
+	rcu_read_lock();
+	for_each_online_cpu(cpu) {
+		int llc = llc_id(cpu);
+		u64 runtime_ms;
+
+		if (llc < 0 || llc > max_lid)
+			continue;
+
+		pcpu_time = per_cpu_ptr(mm->sc_stat.pcpu_time, cpu);
+		runtime_ms = div_u64(pcpu_time->runtime, NSEC_PER_MSEC);
+		buf[llc] += runtime_ms;
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+
 #else
 static inline bool get_llc_stats(int cpu, unsigned long *util,
 				 unsigned long *cap)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [Patch v4 22/22] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics
  2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
                   ` (20 preceding siblings ...)
  2026-04-01 21:52 ` [Patch v4 21/22] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Tim Chen
@ 2026-04-01 21:52 ` Tim Chen
  21 siblings, 0 replies; 23+ messages in thread
From: Tim Chen @ 2026-04-01 21:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
	Gavin Guo, Qais Yousef, Libo Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Debug patch only.

To help investigate any potential performance regressions caused by
cache-aware scheduling in the future, introduce these ftrace events.

The user leverages these trace events (via bpftrace, etc.)
to monitor the cache-aware load balancing activity - specifically,
whether tasks are moved to their preferred LLC, moved out of their
preferred LLC, whether cache-aware load balancing is skipped
due to exceeding the memory footprint limit or too many active
tasks, and the reason why LLC preferred migration is allowed or
not.

Together with existing scheduler events, the newly introduced
events above can be used to narrow down the performance regression.
For example, the regression could be caused by excessive task
migrations among CPUs, which can be tracked either by
trace_sched_attach_task() or by checking the return value of
select_task_rq_fair(). Alternatively, it could be caused by
over-aggregation within a single LLC, which can be identified
via context switch events.

The scanning time to find the hottest LLC is simply recorded,
which can be used to evaluate whether the statistics calculation
for cache-aware scheduling is costly.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v3->v4:
        Add more trace events.

 include/trace/events/sched.h | 140 +++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          |  64 +++++++++++++---
 2 files changed, 192 insertions(+), 12 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 7b2645b50e78..8d1d5fa32ad2 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -10,6 +10,146 @@
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
 
+#ifdef CONFIG_SCHED_CACHE
+TRACE_EVENT(sched_llc_mig,
+	TP_PROTO(unsigned long dst_util, unsigned long dst_cap,
+		unsigned long src_util, unsigned long src_cap,
+		int to_pref, int mig_hint),
+
+	TP_ARGS(dst_util, dst_cap, src_util, src_cap, to_pref, mig_hint),
+
+	TP_STRUCT__entry(
+		__field(unsigned long,	dst_util)
+		__field(unsigned long,	dst_cap)
+		__field(unsigned long,	src_util)
+		__field(unsigned long,	src_cap)
+		__field(int,		to_pref)
+		__field(int,		mig_hint)
+	),
+
+	TP_fast_assign(
+		__entry->dst_util		= dst_util;
+		__entry->dst_cap		= dst_cap;
+		__entry->src_util		= src_util;
+		__entry->src_cap		= src_cap;
+		__entry->to_pref		= to_pref;
+		__entry->mig_hint		= mig_hint;
+	),
+
+	TP_printk("dst_util=%lu dst_cap=%lu src_util=%lu src_cap=%lu to_pref=%d mig_hint=%d",
+		  __entry->dst_util, __entry->dst_cap, __entry->src_util,
+		  __entry->src_cap, __entry->to_pref, __entry->mig_hint)
+);
+
+TRACE_EVENT(sched_llc_scan,
+
+	TP_PROTO(struct task_struct *t, unsigned long cost),
+
+	TP_ARGS(t, cost),
+
+	TP_STRUCT__entry(
+		__array(char,	comm,	TASK_COMM_LEN)
+		__field(pid_t,	pid)
+		__field(unsigned long,	cost)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		__entry->pid		= t->pid;
+		__entry->cost		= cost;
+	),
+
+	TP_printk("comm=%s pid=%d scan_cost=%lu",
+			__entry->comm, __entry->pid,
+			__entry->cost)
+);
+
+TRACE_EVENT(sched_exceed_llc_cap,
+
+	TP_PROTO(struct task_struct *t, int exceeded, int  scale,
+		unsigned long llc,  unsigned long  rss),
+
+	TP_ARGS(t, exceeded, scale, llc,  rss),
+
+	TP_STRUCT__entry(
+		__array(char,	comm,	TASK_COMM_LEN)
+		__field(pid_t,	pid)
+		__field(int,	exceeded)
+		__field(int,	scale)
+		__field(unsigned  long,	llc)
+		__field(unsigned long,	rss)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		__entry->pid		= t->pid;
+		__entry->exceeded	= exceeded;
+		__entry->scale	= scale;
+		__entry->llc	= llc;
+		__entry->rss	= rss;
+	),
+
+	TP_printk("comm=%s pid=%d exceed_cap=%d scale=%d llc=%lu  rss=%lu",
+			__entry->comm, __entry->pid,
+			__entry->exceeded, __entry->scale,
+			__entry->llc, __entry->rss)
+);
+
+TRACE_EVENT(sched_exceed_llc_nr,
+
+	TP_PROTO(struct task_struct *t, int exceeded),
+
+	TP_ARGS(t, exceeded),
+
+	TP_STRUCT__entry(
+		__array(char,	comm,	TASK_COMM_LEN)
+		__field(pid_t,	pid)
+		__field(int,	exceeded)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		__entry->pid		= t->pid;
+		__entry->exceeded	= exceeded;
+	),
+
+	TP_printk("comm=%s pid=%d exceed_nr=%d",
+			__entry->comm, __entry->pid,
+			__entry->exceeded)
+);
+
+TRACE_EVENT(sched_attach_task,
+
+	TP_PROTO(struct task_struct *t, int pref_cpu, int pref_llc,
+		 int attach_cpu, int attach_llc),
+
+	TP_ARGS(t, pref_cpu, pref_llc, attach_cpu, attach_llc),
+
+	TP_STRUCT__entry(
+		__array(char,	comm,	TASK_COMM_LEN)
+		__field(pid_t,	pid)
+		__field(int,	pref_cpu)
+		__field(int,	pref_llc)
+		__field(int,	attach_cpu)
+		__field(int,	attach_llc)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		__entry->pid		= t->pid;
+		__entry->pref_cpu	= pref_cpu;
+		__entry->pref_llc	= pref_llc;
+		__entry->attach_cpu	= attach_cpu;
+		__entry->attach_llc	= attach_llc;
+	),
+
+	TP_printk("comm=%s pid=%d pref_cpu=%d pref_llc=%d attach_cpu=%d attach_llc=%d",
+			__entry->comm, __entry->pid,
+			__entry->pref_cpu, __entry->pref_llc,
+			__entry->attach_cpu, __entry->attach_llc)
+);
+#endif
+
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
  */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2b12918b00fd..f446d755f3c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1337,9 +1337,11 @@ static inline int get_sched_cache_scale(int mul)
 	return (1 + (llc_aggr_tolerance - 1) * mul);
 }
 
-static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu,
+				struct task_struct *p)
 {
 	struct cacheinfo *ci;
+	bool exceeded;
 	u64 rss, llc;
 	int scale;
 
@@ -1385,11 +1387,17 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 	if (scale == INT_MAX)
 		return false;
 
-	return ((llc * (u64)scale) < (rss * PAGE_SIZE));
+	exceeded = ((llc * (u64)scale) < (rss * PAGE_SIZE));
+
+	trace_sched_exceed_llc_cap(p, exceeded, scale, llc, rss);
+
+	return exceeded;
 }
 
-static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
+static bool exceed_llc_nr(struct mm_struct *mm, int cpu,
+			  struct task_struct *p)
 {
+	bool exceeded;
 	int scale;
 
 	/*
@@ -1400,8 +1408,12 @@ static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 	if (scale == INT_MAX)
 		return false;
 
-	return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
+	exceeded = !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
 			(scale * per_cpu(sd_llc_size, cpu)));
+
+	trace_sched_exceed_llc_nr(p, exceeded);
+
+	return exceeded;
 }
 
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
@@ -1606,8 +1618,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	if (time_after(epoch,
 		       READ_ONCE(mm->sc_stat.epoch) + llc_epoch_affinity_timeout) ||
 	    get_nr_threads(p) <= 1 ||
-	    exceed_llc_nr(mm, cpu_of(rq)) ||
-	    exceed_llc_capacity(mm, cpu_of(rq))) {
+	    exceed_llc_nr(mm, cpu_of(rq), p) ||
+	    exceed_llc_capacity(mm, cpu_of(rq), p)) {
 		if (mm->sc_stat.cpu != -1)
 			mm->sc_stat.cpu = -1;
 	}
@@ -1718,6 +1730,7 @@ static void task_cache_work(struct callback_head *work)
 	struct mm_struct *mm = p->mm;
 	unsigned long m_a_occ = 0;
 	cpumask_var_t cpus;
+	u64 t0, scan_cost;
 
 	WARN_ON_ONCE(work != &p->cache_work);
 
@@ -1728,7 +1741,7 @@ static void task_cache_work(struct callback_head *work)
 
 	curr_cpu = task_cpu(p);
 	if (get_nr_threads(p) <= 1 ||
-	    exceed_llc_capacity(mm, curr_cpu)) {
+	    exceed_llc_capacity(mm, curr_cpu, p)) {
 		if (mm->sc_stat.cpu != -1)
 			mm->sc_stat.cpu = -1;
 
@@ -1738,6 +1751,8 @@ static void task_cache_work(struct callback_head *work)
 	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
 		return;
 
+	t0 = sched_clock_cpu(curr_cpu);
+
 	scoped_guard (cpus_read_lock) {
 		get_scan_cpumasks(cpus, p);
 
@@ -1793,6 +1808,9 @@ static void task_cache_work(struct callback_head *work)
 		}
 	}
 
+	scan_cost = sched_clock_cpu(curr_cpu) - t0;
+	trace_sched_llc_scan(p, scan_cost);
+
 	if (m_a_occ > (2 * curr_m_a_occ)) {
 		/*
 		 * Avoid switching sc_stat.cpu too fast.
@@ -10192,8 +10210,11 @@ static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
 	dst_util = dst_util + tsk_util;
 
 	if (!fits_llc_capacity(dst_util, dst_cap) &&
-	    !fits_llc_capacity(src_util, src_cap))
+	    !fits_llc_capacity(src_util, src_cap)) {
+		trace_sched_llc_mig(dst_util, dst_cap, src_util, src_cap,
+				    to_pref, mig_unrestricted);
 		return mig_unrestricted;
+	}
 
 	if (to_pref) {
 		/*
@@ -10203,8 +10224,11 @@ static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
 		 * increase the imbalance too much.
 		 */
 		if (!fits_llc_capacity(dst_util, dst_cap) &&
-		    util_greater(dst_util, src_util))
+		    util_greater(dst_util, src_util)) {
+			trace_sched_llc_mig(dst_util, dst_cap, src_util, src_cap,
+					    to_pref, mig_forbid);
 			return mig_forbid;
+		}
 	} else {
 		/*
 		 * Don't migrate if we will leave preferred LLC
@@ -10214,9 +10238,15 @@ static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
 		 * back to preferred LLC.
 		 */
 		if (fits_llc_capacity(src_util, src_cap) ||
-		    !util_greater(src_util, dst_util))
+		    !util_greater(src_util, dst_util)) {
+			trace_sched_llc_mig(dst_util, dst_cap, src_util, src_cap,
+					    to_pref, mig_forbid);
 			return mig_forbid;
+		}
 	}
+
+	trace_sched_llc_mig(dst_util, dst_cap, src_util, src_cap,
+			    to_pref, mig_llc);
 	return mig_llc;
 }
 
@@ -10243,8 +10273,8 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
 	 * Skip cache aware load balance for single/too many threads
 	 * or large memory RSS.
 	 */
-	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
-	    exceed_llc_capacity(mm, dst_cpu)) {
+	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu, p) ||
+	    exceed_llc_capacity(mm, dst_cpu, p)) {
 		if (mm->sc_stat.cpu != -1)
 			mm->sc_stat.cpu = -1;
 		return mig_unrestricted;
@@ -10722,6 +10752,16 @@ static void attach_task(struct rq *rq, struct task_struct *p)
 {
 	lockdep_assert_rq_held(rq);
 
+#ifdef CONFIG_SCHED_CACHE
+	if (p->mm) {
+		int pref_cpu = p->mm->sc_stat.cpu;
+
+		trace_sched_attach_task(p,
+					pref_cpu,
+					pref_cpu != -1 ? llc_id(pref_cpu) : -1,
+					cpu_of(rq), llc_id(cpu_of(rq)));
+	}
+#endif
 	WARN_ON_ONCE(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
 	wakeup_preempt(rq, p, 0);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-04-01 21:47 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-01 21:52 [Patch v4 00/22] Cache aware scheduling Tim Chen
2026-04-01 21:52 ` [Patch v4 01/22] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
2026-04-01 21:52 ` [Patch v4 02/22] sched/cache: Limit the scan number of CPUs when calculating task occupancy Tim Chen
2026-04-01 21:52 ` [Patch v4 03/22] sched/cache: Record per LLC utilization to guide cache aware scheduling decisions Tim Chen
2026-04-01 21:52 ` [Patch v4 04/22] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
2026-04-01 21:52 ` [Patch v4 05/22] sched/cache: Make LLC id continuous Tim Chen
2026-04-01 21:52 ` [Patch v4 06/22] sched/cache: Assign preferred LLC ID to processes Tim Chen
2026-04-01 21:52 ` [Patch v4 07/22] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
2026-04-01 21:52 ` [Patch v4 08/22] sched/cache: Introduce per CPU's tasks LLC preference counter Tim Chen
2026-04-01 21:52 ` [Patch v4 09/22] sched/cache: Calculate the percpu sd task LLC preference Tim Chen
2026-04-01 21:52 ` [Patch v4 10/22] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
2026-04-01 21:52 ` [Patch v4 11/22] sched/cache: Check local_group only once in update_sg_lb_stats() Tim Chen
2026-04-01 21:52 ` [Patch v4 12/22] sched/cache: Prioritize tasks preferring destination LLC during balancing Tim Chen
2026-04-01 21:52 ` [Patch v4 13/22] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
2026-04-01 21:52 ` [Patch v4 14/22] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
2026-04-01 21:52 ` [Patch v4 15/22] sched/cache: Respect LLC preference in task migration and detach Tim Chen
2026-04-01 21:52 ` [Patch v4 16/22] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
2026-04-01 21:52 ` [Patch v4 17/22] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
2026-04-01 21:52 ` [Patch v4 18/22] sched/cache: Enable cache aware scheduling for multi LLCs NUMA node Tim Chen
2026-04-01 21:52 ` [Patch v4 19/22] sched/cache: Allow the user space to turn on and off cache aware scheduling Tim Chen
2026-04-01 21:52 ` [Patch v4 20/22] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Tim Chen
2026-04-01 21:52 ` [Patch v4 21/22] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Tim Chen
2026-04-01 21:52 ` [Patch v4 22/22] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Tim Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox