[PATCH 00/19] Cache Aware Scheduling

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/19] Cache Aware Scheduling
@ 2025-10-11 18:24 Tim Chen
  2025-10-11 18:24 ` [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing Tim Chen
                   ` (19 more replies)
  0 siblings, 20 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

There had been 4 RFC postings of this patch set. We've incorporated
the feedbacks and comments and now would like to post this patch set
for consideration of inclusion to mainline. The patches are based on
the original patch proposed by Peter[1].

The goal of the patch series is to aggregate tasks sharing data
to the same LLC cache domain, thereby reducing cache bouncing and
cache misses, and improve data access efficiency. In the current
implementation, threads within the same process are considered
as entities that potentially share resources.

The changes from v4 RFC patches are minor. Most are commit log and
and code clean ups per feedbacks. Several bugs were fixed:
1. A memory leak of not freeing cache aware scheduling structure when struct mm is freed.
2. A false sharing regression involving nr_running_avg.
3. Bug for initializing cache aware scheduling structures for system with no L3.

Peter suggested enhancing the patch set to allow task aggregation into
secondary LLCs when the preferred LLC becomes overloaded. We have not
implemented that in this version. In our previous testing, maintaining
stable LLC preferences proved important to avoid excessive task
migrations, which can undermine cache locality benefits. Additionally,
migrating tasks between primary and secondary LLCs often caused cache
bouncing, making the locality gains from using a secondary LLC marginal.
We would have to take a closer look to see if such scheme can 
can be done without the such problems. 

The following tunables control under /sys/kernel/debug/sched/ control
the behavior of cache aware scheduling:

1. llc_aggr_tolerance Controls how aggressive we aggregate tasks to
their preferred LLC, based on a process's RSS size and number of running
threads.  Processes that have smaller memory footprint and fewer number
of tasks will benefit better from aggregation.  Varies between 0 to 100
        0:  Cache aware scheduling is disabled 1:  Process with RSS
        greater than LLC size,
	    or running threads more than number of cpu cores/LLC skip
	    aggregation
	100:  Aggressive; a process's threads are aggregated regardless of
	      RSS or running threads.
For example, with a 32MB L3 cache 8 cores in L3:
    llc_aggr_tolerance=1 -> process with RSS > 32MB, or nr_running_avg >
    8 are skipped.  llc_aggr_tolerance=99 -> process with RSS > 784GB
    or nr_running_avg > 785 are skipped.  784GB = (1 + (99 - 1) * 256)
    * 32MB.
     785  = (1 + (99 - 1) * 8).

Currently this knob is a global control. Considering that different workloads have
different requirements for task consolidation, it would be ideal to introduce
per process control for this knob via prctl in the future.

2. llc_overload_pct, llc_imb_pct
We'll always try to move a task to its preferred LLC if an LLC's average core
utilization is below llc_overload_pct (default to 50%). Otherwise, the utilization
of preferred LLC has to be not more than llc_imb_pct (default to 20%) to move a task
to it. This is to prevent overloading on the preferred LLC.

3. llc_epoch_period
Controls how often the scheduler collect LLC occupancy of a process (default to 10 msec)

4. llc_epoch_affinity_timeout
Detect that if a process has not run for llc_epoch_affinity_timeout (default to 50 msec),
it loses its cache preference.

Test results:
The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.

The second test platform is a AMD Milan. There are 2 Nodes and 64 CPUs
per node. Each node has 8 CCXs and each CCX has 8 CPUs.

The third test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs per node.
Each node has 2 CCXs and each CCX has 16 CPUs.

[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when there is 1 group
with different number of fd pairs(threads) within this process.
schbench shows overall wakeup latency improvement.
ChaCha20-xiangshan shows ~10% throughput improvement. Other
micro-workloads did not show much difference.

Milan:
No obvious difference is observed so far.

Genoa:
ChaCha20-xiangshan shows 44% throughput improvement.

[Sapphire Rapids details]

[hackbench]
Hackbench show overall improvement when there is only 1
group, with different number of fd(pairs). This is the
expected behavior because this test scenario would benefit
from cache aware load balance most. Other number of groups
shows not much difference(using default fd = 20).

       groups              baseline            sched_cache
Min       1      37.5960 (   0.00%)     26.4340 (  29.69%)
Min       3      38.7050 (   0.00%)     38.6920 (   0.03%)
Min       5      39.4550 (   0.00%)     38.6280 (   2.10%)
Min       7      51.4270 (   0.00%)     50.6790 (   1.45%)
Min       12     62.8540 (   0.00%)     63.6590 (  -1.28%)
Min       16     74.0160 (   0.00%)     74.7480 (  -0.99%)
Amean     1      38.4768 (   0.00%)     26.7146 *  30.57%*
Amean     3      39.0750 (   0.00%)     39.5586 (  -1.24%)
Amean     5      41.5178 (   0.00%)     41.2766 (   0.58%)
Amean     7      52.1164 (   0.00%)     51.5152 (   1.15%)
Amean     12     63.9052 (   0.00%)     64.0420 (  -0.21%)
Amean     16     74.5812 (   0.00%)     75.4318 (  -1.14%)
BAmean-99 1      38.2027 (   0.00%)     26.5500 (  30.50%)
BAmean-99 3      38.8725 (   0.00%)     39.2225 (  -0.90%)
BAmean-99 5      41.1898 (   0.00%)     41.0037 (   0.45%)
BAmean-99 7      51.8645 (   0.00%)     51.4453 (   0.81%)
BAmean-99 12     63.6317 (   0.00%)     63.9307 (  -0.47%)
BAmean-99 16     74.4528 (   0.00%)     75.2113 (  -1.02%)

[schbench]
Wakeup Latencies 99.0th improvement is observed.

threads          baseline             sched_cache          change
1                13.80(1.10)          14.80(2.86)          -7.25%
2                12.00(1.00)          8.00(2.12)           +33.33%
4                9.00(0.00)           5.60(0.89)           +37.78%
8                9.00(0.00)           6.40(1.14)           +28.89%
16               9.20(0.45)           6.20(0.84)           +32.61%
32               9.60(0.55)           7.00(0.71)           +27.08%
64               10.80(0.45)          8.40(0.55)           +22.22%
128              12.60(0.55)          11.40(0.55)          +9.52%
239              14.00(0.00)          14.20(0.45)          -1.43%

[stream]
No much difference is observed.
                             baseline                     sc
GB/sec copy-2        35.00 (   0.00%)       34.79 (  -0.60%)
GB/sec scale-2       24.04 (   0.00%)       23.90 (  -0.58%)
GB/sec add-2         28.98 (   0.00%)       28.92 (  -0.22%)
GB/sec triad-2       28.32 (   0.00%)       28.31 (  -0.04%)

[netperf]
No much difference is observed(consider the stdev).

         nr_pairs          netperf                netperf

Hmean     60      1023.44 (   0.00%)     1021.87 (  -0.15%)
BHmean-99 60      1023.78 (   0.00%)     1022.22 (  -0.15%)
Hmean     120      792.09 (   0.00%)      793.75 (   0.21%)
BHmean-99 120      792.36 (   0.00%)      794.04 (   0.21%)
Hmean     180      513.42 (   0.00%)      513.53 (   0.02%)
BHmean-99 180      513.81 (   0.00%)      513.80 (  -0.00%)
Hmean     240      387.09 (   0.00%)      387.33 (   0.06%)
BHmean-99 240      387.18 (   0.00%)      387.45 (   0.07%)
Hmean     300      316.04 (   0.00%)      315.68 (  -0.12%)
BHmean-99 300      316.12 (   0.00%)      315.77 (  -0.11%)
Hmean     360      496.38 (   0.00%)      455.49 (  -8.24%)
BHmean-99 360      499.88 (   0.00%)      458.17 (  -8.34%)
Hmean     420      497.32 (   0.00%)      501.84 (   0.91%)
BHmean-99 420      499.90 (   0.00%)      504.56 (   0.93%)
Hmean     480      417.62 (   0.00%)      432.25 (   3.50%)
BHmean-99 480      419.96 (   0.00%)      434.43 (   3.45%)

In above case of 360 pairs, although there is a performance
drop of 8.24%, the corresponding:
HCoeffVar   360    23.78 (   0.00%)       29.52 ( -24.15%)
shows that the regression is within the run-to-run variance.

[Milan details]

default settings:
[hackbench]

Min       1      50.8170 (   0.00%)     51.1890 (  -0.73%)
Min       3      59.3610 (   0.00%)     58.6080 (   1.27%)
Min       5      94.9760 (   0.00%)     96.0210 (  -1.10%)
Min       7     123.3270 (   0.00%)    124.1680 (  -0.68%)
Min       12    179.2000 (   0.00%)    181.8390 (  -1.47%)
Min       16    238.8680 (   0.00%)    242.6390 (  -1.58%)
Amean     1      51.6614 (   0.00%)     51.3630 (   0.58%)
Amean     3      60.1886 (   0.00%)     59.4542 (   1.22%)
Amean     5      95.7602 (   0.00%)     96.8338 (  -1.12%)
Amean     7     124.0332 (   0.00%)    124.4406 (  -0.33%)
Amean     12    181.0324 (   0.00%)    182.9220 (  -1.04%)
Amean     16    239.5556 (   0.00%)    243.3556 *  -1.59%*
BAmean-99 1      51.5335 (   0.00%)     51.3338 (   0.39%)
BAmean-99 3      59.7848 (   0.00%)     59.0958 (   1.15%)
BAmean-99 5      95.6698 (   0.00%)     96.5450 (  -0.91%)
BAmean-99 7     123.8478 (   0.00%)    124.3760 (  -0.43%)
BAmean-99 12    180.8035 (   0.00%)    182.5135 (  -0.95%)
BAmean-99 16    239.1933 (   0.00%)    243.0570 (  -1.62%)

[schbench]

threads          baseline             sched_cache          change
1                12.00(2.00)          11.00(0.71)          +8.33%
2                12.40(0.89)          13.80(0.84)          -11.29%
4                14.20(0.45)          14.80(0.45)          -4.23%
8                16.00(0.00)          15.80(0.45)          +1.25%
16               16.00(0.00)          16.00(0.71)          0.00%
32               19.40(0.55)          18.60(0.55)          +4.12%
63               22.20(0.45)          23.20(0.45)          -4.50%

[stream]
No obvious difference is found.
export STREAM_SIZE=$((128000000))

                     baseline               sched_cache
GB/sec copy-16       726.48 (   0.00%)      715.60 (  -1.50%)
GB/sec scale-16      577.71 (   0.00%)      577.03 (  -0.12%)
GB/sec add-16        678.85 (   0.00%)      672.87 (  -0.88%)
GB/sec triad-16      735.52 (   0.00%)      729.05 (  -0.88%)

[netperf]
No much difference is observed.

         nr_pairs          baseline           sched_cache
Hmean     32       755.98 (   0.00%)      755.17 (  -0.11%)
BHmean-99 32       756.42 (   0.00%)      755.40 (  -0.13%)
Hmean     64       677.38 (   0.00%)      669.75 (  -1.13%)
BHmean-99 64       677.50 (   0.00%)      669.86 (  -1.13%)
Hmean     96       498.52 (   0.00%)      496.73 (  -0.36%)
BHmean-99 96       498.69 (   0.00%)      496.93 (  -0.35%)
Hmean     128      604.38 (   0.00%)      604.22 (  -0.03%)
BHmean-99 128      604.87 (   0.00%)      604.87 (   0.00%)
Hmean     160      471.67 (   0.00%)      468.29 (  -0.72%)
BHmean-99 160      474.34 (   0.00%)      471.05 (  -0.69%)
Hmean     192      381.18 (   0.00%)      384.88 (   0.97%)
BHmean-99 192      383.30 (   0.00%)      386.82 (   0.92%)
Hmean     224      327.79 (   0.00%)      326.05 (  -0.53%)
BHmean-99 224      329.85 (   0.00%)      327.87 (  -0.60%)
Hmean     256      284.61 (   0.00%)      300.52 (   5.59%)
BHmean-99 256      286.41 (   0.00%)      302.06 (   5.47%)

[Genoa details]
[ChaCha20-xiangshan]
ChaCha20-xiangshan is a simple benchmark using a static build of an
8-thread Verilator of XiangShan(RISC-V). The README file can be
found here[2]. The score depends on how aggressive the user set the
/sys/kernel/debug/sched/llc_aggr_tolerance. Using the default values,
there is no much difference observed. While setting the
/sys/kernel/debug/sched/llc_aggr_tolerance to 100, 44% improvment is
observed.

baseline:
Host time spent: 50,868ms

sched_cache:
Host time spent: 28,349ms

The time has been reduced by 44%.

Thanks to everyone who participated and provided valuable suggestions for
the previous versions. Your comments and tests on the latest version are
also greatly appreciated in advance.

Tim

[1] https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/

[2] https://github.com/yu-chen-surf/chacha20-xiangshan/blob/master/README.eng.md

RFC v4:
[3] https://lore.kernel.org/all/cover.1754712565.git.tim.c.chen@linux.intel.com/

RFC v3
[4] https://lore.kernel.org/all/cover.1750268218.git.tim.c.chen@linux.intel.com/

RFC v2:
[5] https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/

Chen Yu (7):
  sched/fair: Record per-LLC utilization to guide cache-aware scheduling
    decisions
  sched/fair: Introduce helper functions to enforce LLC migration policy
  sched/fair: Introduce a static key to enable cache aware only for
    multi LLCs
  sched/fair: Exclude processes with many threads from cache-aware
    scheduling
  sched/fair: Disable cache aware scheduling for processes with high
    thread counts
  sched/fair: Avoid cache-aware scheduling for memory-heavy processes
  sched/fair: Add user control to adjust the tolerance of cache-aware
    scheduling

Peter Zijlstra (Intel) (1):
  sched/fair: Add infrastructure for cache-aware load balancing

Tim Chen (11):
  sched/fair: Add LLC index mapping for CPUs
  sched/fair: Assign preferred LLC ID to processes
  sched/fair: Track LLC-preferred tasks per runqueue
  sched/fair: Introduce per runqueue task LLC preference counter
  sched/fair: Count tasks prefering each LLC in a sched group
  sched/fair: Prioritize tasks preferring destination LLC during
    balancing
  sched/fair: Identify busiest sched_group for LLC-aware load balancing
  sched/fair: Add migrate_llc_task migration type for cache-aware
    balancing
  sched/fair: Handle moving single tasks to/from their preferred LLC
  sched/fair: Consider LLC preference when selecting tasks for load
    balancing
  sched/fair: Respect LLC preference in task migration and detach

 include/linux/cacheinfo.h      |   21 +-
 include/linux/mm_types.h       |   45 ++
 include/linux/sched.h          |    5 +
 include/linux/sched/topology.h |    4 +
 include/linux/threads.h        |   10 +
 init/Kconfig                   |   20 +
 init/init_task.c               |    3 +
 kernel/fork.c                  |    6 +
 kernel/sched/core.c            |   18 +
 kernel/sched/debug.c           |   56 ++
 kernel/sched/fair.c            | 1022 +++++++++++++++++++++++++++++++-
 kernel/sched/features.h        |    1 +
 kernel/sched/sched.h           |   27 +
 kernel/sched/topology.c        |   61 +-
 14 files changed, 1283 insertions(+), 16 deletions(-)

-- 
2.32.0

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-14 19:12   ` Madadi Vineeth Reddy
                     ` (2 more replies)
  2025-10-11 18:24 ` [PATCH 02/19] sched/fair: Record per-LLC utilization to guide cache-aware scheduling decisions Tim Chen
                   ` (18 subsequent siblings)
  19 siblings, 3 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Tim Chen, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

From: "Peter Zijlstra (Intel)" <peterz@infradead.org>

Cache-aware load balancing aims to aggregate tasks with potential
shared resources into the same cache domain. This approach enhances
cache locality, thereby optimizing system performance by reducing
cache misses and improving data access efficiency.

In the current implementation, threads within the same process are
considered as entities that potentially share resources.
Cache-aware load balancing monitors the CPU occupancy of each cache
domain for every process. Based on this monitoring, it endeavors to
migrate threads within a given process to its cache-hot domains,
with the goal of maximizing cache locality.

It is an attempt at modelling cache affinity. While the patch series
only targets LLC, it could very well be extended to clusters (L2),
or other kind of domains grouping inside a node.

As it stands, the mechanism only computes a CPU within the LLC that
has the highest recent runtime; this CPU is then used in the load
balance path in subsequent patches to steer toward this LLC.

More elaborate measures could be added later in NUMA_BALANCING: for
example, migrating task A to its preferred LLC when it has spare CPU
capacity, or swapping task A with another running task B in task A’s
preferred LLC.

Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/mm_types.h |  44 ++++++
 include/linux/sched.h    |   4 +
 init/Kconfig             |  11 ++
 kernel/fork.c            |   6 +
 kernel/sched/core.c      |   6 +
 kernel/sched/fair.c      | 288 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h  |   1 +
 kernel/sched/sched.h     |   8 ++
 8 files changed, 368 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 08bc2442db93..3ca557c2f36d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -927,6 +927,11 @@ struct mm_cid {
 };
 #endif
 
+struct mm_sched {
+	u64 runtime;
+	unsigned long epoch;
+};
+
 struct kioctx_table;
 struct iommu_mm_data;
 struct mm_struct {
@@ -1017,6 +1022,17 @@ struct mm_struct {
 		 */
 		raw_spinlock_t cpus_allowed_lock;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		/*
+		 * Track per-cpu-per-process occupancy as a proxy for cache residency.
+		 * See account_mm_sched() and ...
+		 */
+		struct mm_sched __percpu *pcpu_sched;
+		raw_spinlock_t mm_sched_lock;
+		unsigned long mm_sched_epoch;
+		int mm_sched_cpu;
+#endif
+
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
 #endif
@@ -1436,6 +1452,34 @@ static inline unsigned int mm_cid_size(void)
 static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
 #endif /* CONFIG_SCHED_MM_CID */
 
+#ifdef CONFIG_SCHED_CACHE
+void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sched);
+
+static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
+{
+	struct mm_sched __percpu *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
+
+	if (!pcpu_sched)
+		return -ENOMEM;
+
+	mm_init_sched(mm, pcpu_sched);
+	return 0;
+}
+
+#define mm_alloc_sched(...)	alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__))
+
+static inline void mm_destroy_sched(struct mm_struct *mm)
+{
+	free_percpu(mm->pcpu_sched);
+	mm->pcpu_sched = NULL;
+}
+#else /* !CONFIG_SCHED_CACHE */
+
+static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; }
+static inline void mm_destroy_sched(struct mm_struct *mm) { }
+
+#endif /* CONFIG_SCHED_CACHE */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f8188b833350..d7ddb7ce6c4b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1400,6 +1400,10 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_SCHED_CACHE
+	struct callback_head		cache_work;
+#endif
+
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
 	u32 rseq_len;
diff --git a/init/Kconfig b/init/Kconfig
index e3eb63eadc87..4e625db7920a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -970,6 +970,17 @@ config NUMA_BALANCING
 
 	  This system will be inactive on UMA systems.
 
+config SCHED_CACHE
+	bool "Cache aware load balance"
+	default y
+	depends on SMP
+	help
+	  When enabled, the scheduler will attempt to aggregate tasks from
+	  the same process onto a single Last Level Cache (LLC) domain when
+	  possible. This improves cache locality by keeping tasks that share
+	  resources within the same cache domain, reducing cache misses and
+	  lowering data access latency.
+
 config NUMA_BALANCING_DEFAULT_ENABLED
 	bool "Automatically enable NUMA aware memory/task placement"
 	default y
diff --git a/kernel/fork.c b/kernel/fork.c
index c4ada32598bd..9cd6efe2926d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -680,6 +680,7 @@ void __mmdrop(struct mm_struct *mm)
 	cleanup_lazy_tlbs(mm);
 
 	WARN_ON_ONCE(mm == current->active_mm);
+	mm_destroy_sched(mm);
 	mm_free_pgd(mm);
 	mm_free_id(mm);
 	destroy_context(mm);
@@ -1079,6 +1080,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_cid(mm, p))
 		goto fail_cid;
 
+	if (mm_alloc_sched(mm))
+		goto fail_sched;
+
 	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
 				     NR_MM_COUNTERS))
 		goto fail_pcpu;
@@ -1088,6 +1092,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	return mm;
 
 fail_pcpu:
+	mm_destroy_sched(mm);
+fail_sched:
 	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index be00629f0ba4..79d15e904d12 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4520,6 +4520,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
 	init_sched_mm_cid(p);
+	init_sched_mm(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -8821,6 +8822,11 @@ void __init sched_init(void)
 
 		rq->core_cookie = 0UL;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		raw_spin_lock_init(&rq->cpu_epoch_lock);
+		rq->cpu_epoch_next = jiffies;
+#endif
+
 		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..a2ea002f4fd6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1152,6 +1152,8 @@ void post_init_entity_util_avg(struct task_struct *p)
 	sa->runnable_avg = sa->util_avg;
 }
 
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec);
+
 static s64 update_se(struct rq *rq, struct sched_entity *se)
 {
 	u64 now = rq_clock_task(rq);
@@ -1174,6 +1176,7 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 
 		trace_sched_stat_runtime(running, delta_exec);
 		account_group_exec_runtime(running, delta_exec);
+		account_mm_sched(rq, donor, delta_exec);
 
 		/* cgroup time is always accounted against the donor */
 		cgroup_account_cputime(donor, delta_exec);
@@ -1193,6 +1196,289 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 	return delta_exec;
 }
 
+#ifdef CONFIG_SCHED_CACHE
+
+/*
+ * XXX numbers come from a place the sun don't shine -- probably wants to be SD
+ * tunable or so.
+ */
+#define EPOCH_PERIOD	(HZ / 100)	/* 10 ms */
+#define EPOCH_LLC_AFFINITY_TIMEOUT	5	/* 50 ms */
+
+static int llc_id(int cpu)
+{
+	if (cpu < 0)
+		return -1;
+
+	return per_cpu(sd_llc_id, cpu);
+}
+
+void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
+{
+	unsigned long epoch;
+	int i;
+
+	for_each_possible_cpu(i) {
+		struct mm_sched *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
+		struct rq *rq = cpu_rq(i);
+
+		pcpu_sched->runtime = 0;
+		pcpu_sched->epoch = rq->cpu_epoch;
+		epoch = rq->cpu_epoch;
+	}
+
+	raw_spin_lock_init(&mm->mm_sched_lock);
+	mm->mm_sched_epoch = epoch;
+	mm->mm_sched_cpu = -1;
+
+	/*
+	 * The update to mm->pcpu_sched should not be reordered
+	 * before initialization to mm's other fields, in case
+	 * the readers may get invalid mm_sched_epoch, etc.
+	 */
+	smp_store_release(&mm->pcpu_sched, _pcpu_sched);
+}
+
+/* because why would C be fully specified */
+static __always_inline void __shr_u64(u64 *val, unsigned int n)
+{
+	if (n >= 64) {
+		*val = 0;
+		return;
+	}
+	*val >>= n;
+}
+
+static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
+{
+	lockdep_assert_held(&rq->cpu_epoch_lock);
+
+	unsigned long n, now = jiffies;
+	long delta = now - rq->cpu_epoch_next;
+
+	if (delta > 0) {
+		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		rq->cpu_epoch += n;
+		rq->cpu_epoch_next += n * EPOCH_PERIOD;
+		__shr_u64(&rq->cpu_runtime, n);
+	}
+
+	n = rq->cpu_epoch - pcpu_sched->epoch;
+	if (n) {
+		pcpu_sched->epoch += n;
+		__shr_u64(&pcpu_sched->runtime, n);
+	}
+}
+
+static unsigned long __no_profile fraction_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
+{
+	guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
+
+	__update_mm_sched(rq, pcpu_sched);
+
+	/*
+	 * Runtime is a geometric series (r=0.5) and as such will sum to twice
+	 * the accumulation period, this means the multiplcation here should
+	 * not overflow.
+	 */
+	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
+}
+
+static inline
+void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
+{
+	struct mm_struct *mm = p->mm;
+	struct mm_sched *pcpu_sched;
+	unsigned long epoch;
+
+	if (!sched_feat(SCHED_CACHE))
+		return;
+
+	if (p->sched_class != &fair_sched_class)
+		return;
+	/*
+	 * init_task and kthreads don't having mm
+	 */
+	if (!mm || !mm->pcpu_sched)
+		return;
+
+	pcpu_sched = per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq));
+
+	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
+		__update_mm_sched(rq, pcpu_sched);
+		pcpu_sched->runtime += delta_exec;
+		rq->cpu_runtime += delta_exec;
+		epoch = rq->cpu_epoch;
+	}
+
+	/*
+	 * If this task hasn't hit task_cache_work() for a while, or it
+	 * has only 1 thread, invalidate its preferred state.
+	 */
+	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
+	    get_nr_threads(p) <= 1) {
+		if (mm->mm_sched_cpu != -1)
+			mm->mm_sched_cpu = -1;
+	}
+}
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p)
+{
+	struct callback_head *work = &p->cache_work;
+	struct mm_struct *mm = p->mm;
+
+	if (!sched_feat(SCHED_CACHE))
+		return;
+
+	if (!mm || !mm->pcpu_sched)
+		return;
+
+	if (mm->mm_sched_epoch == rq->cpu_epoch)
+		return;
+
+	guard(raw_spinlock)(&mm->mm_sched_lock);
+
+	if (work->next == work) {
+		task_work_add(p, work, TWA_RESUME);
+		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
+	}
+}
+
+static void get_scan_cpumasks(cpumask_var_t cpus, int cache_cpu,
+			      int pref_nid, int curr_cpu)
+{
+#ifdef CONFIG_NUMA_BALANCING
+	/* First honor the task's preferred node. */
+	if (pref_nid != NUMA_NO_NODE)
+		cpumask_or(cpus, cpus, cpumask_of_node(pref_nid));
+#endif
+
+	/* Next honor the task's cache CPU if it is not included. */
+	if (cache_cpu != -1 && !cpumask_test_cpu(cache_cpu, cpus))
+		cpumask_or(cpus, cpus,
+			   cpumask_of_node(cpu_to_node(cache_cpu)));
+
+	/*
+	 * Lastly make sure that the task's current running node is
+	 * considered.
+	 */
+	if (!cpumask_test_cpu(curr_cpu, cpus))
+		cpumask_or(cpus, cpus, cpumask_of_node(cpu_to_node(curr_cpu)));
+}
+
+static void __no_profile task_cache_work(struct callback_head *work)
+{
+	struct task_struct *p = current;
+	struct mm_struct *mm = p->mm;
+	unsigned long m_a_occ = 0;
+	unsigned long curr_m_a_occ = 0;
+	int cpu, m_a_cpu = -1, cache_cpu,
+	    pref_nid = NUMA_NO_NODE, curr_cpu;
+	cpumask_var_t cpus;
+
+	WARN_ON_ONCE(work != &p->cache_work);
+
+	work->next = work;
+
+	if (p->flags & PF_EXITING)
+		return;
+
+	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
+		return;
+
+	curr_cpu = task_cpu(p);
+	cache_cpu = mm->mm_sched_cpu;
+#ifdef CONFIG_NUMA_BALANCING
+	if (static_branch_likely(&sched_numa_balancing))
+		pref_nid = p->numa_preferred_nid;
+#endif
+
+	scoped_guard (cpus_read_lock) {
+		get_scan_cpumasks(cpus, cache_cpu,
+				  pref_nid, curr_cpu);
+
+		for_each_cpu(cpu, cpus) {
+			/* XXX sched_cluster_active */
+			struct sched_domain *sd = per_cpu(sd_llc, cpu);
+			unsigned long occ, m_occ = 0, a_occ = 0;
+			int m_cpu = -1, i;
+
+			if (!sd)
+				continue;
+
+			for_each_cpu(i, sched_domain_span(sd)) {
+				occ = fraction_mm_sched(cpu_rq(i),
+							per_cpu_ptr(mm->pcpu_sched, i));
+				a_occ += occ;
+				if (occ > m_occ) {
+					m_occ = occ;
+					m_cpu = i;
+				}
+			}
+
+			/*
+			 * Compare the accumulated occupancy of each LLC. The
+			 * reason for using accumulated occupancy rather than average
+			 * per CPU occupancy is that it works better in asymmetric LLC
+			 * scenarios.
+			 * For example, if there are 2 threads in a 4CPU LLC and 3
+			 * threads in an 8CPU LLC, it might be better to choose the one
+			 * with 3 threads. However, this would not be the case if the
+			 * occupancy is divided by the number of CPUs in an LLC (i.e.,
+			 * if average per CPU occupancy is used).
+			 * Besides, NUMA balancing fault statistics behave similarly:
+			 * the total number of faults per node is compared rather than
+			 * the average number of faults per CPU. This strategy is also
+			 * followed here.
+			 */
+			if (a_occ > m_a_occ) {
+				m_a_occ = a_occ;
+				m_a_cpu = m_cpu;
+			}
+
+			if (llc_id(cpu) == llc_id(mm->mm_sched_cpu))
+				curr_m_a_occ = a_occ;
+
+			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+		}
+	}
+
+	if (m_a_occ > (2 * curr_m_a_occ)) {
+		/*
+		 * Avoid switching mm_sched_cpu too fast.
+		 * The reason to choose 2X is because:
+		 * 1. It is better to keep the preferred LLC stable,
+		 *    rather than changing it frequently and cause migrations
+		 * 2. 2X means the new preferred LLC has at least 1 more
+		 *    busy CPU than the old one(200% vs 100%, eg)
+		 * 3. 2X is chosen based on test results, as it delivers
+		 *    the optimal performance gain so far.
+		 */
+		mm->mm_sched_cpu = m_a_cpu;
+	}
+
+	free_cpumask_var(cpus);
+}
+
+void init_sched_mm(struct task_struct *p)
+{
+	struct callback_head *work = &p->cache_work;
+
+	init_task_work(work, task_cache_work);
+	work->next = work;
+}
+
+#else
+
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
+				    s64 delta_exec) { }
+
+void init_sched_mm(struct task_struct *p) { }
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
+
+#endif
+
 /*
  * Used by other classes to account runtime.
  */
@@ -13031,6 +13317,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
+	task_tick_cache(rq, curr);
+
 	update_misfit_status(curr, rq);
 	check_update_overutilized_status(task_rq(curr));
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..d2af7bfd36bf 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -87,6 +87,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_UTIL, true)
 
+SCHED_FEAT(SCHED_CACHE, true)
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
  * in a single rq->lock section. Default disabled because the
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d104f7..2ded8d3d0ecc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1166,6 +1166,12 @@ struct rq {
 	u64			clock_pelt_idle_copy;
 	u64			clock_idle_copy;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	raw_spinlock_t		cpu_epoch_lock ____cacheline_aligned;
+	u64			cpu_runtime;
+	unsigned long		cpu_epoch;
+	unsigned long		cpu_epoch_next;
+#endif
 
 	atomic_t		nr_iowait;
 
@@ -3790,6 +3796,8 @@ static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) { }
 static inline void init_sched_mm_cid(struct task_struct *t) { }
 #endif /* !CONFIG_SCHED_MM_CID */
 
+extern void init_sched_mm(struct task_struct *p);
+
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 static inline
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 02/19] sched/fair: Record per-LLC utilization to guide cache-aware scheduling decisions
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
  2025-10-11 18:24 ` [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-15 10:15   ` Peter Zijlstra
  2025-10-27  5:01   ` K Prateek Nayak
  2025-10-11 18:24 ` [PATCH 03/19] sched/fair: Introduce helper functions to enforce LLC migration policy Tim Chen
                   ` (17 subsequent siblings)
  19 siblings, 2 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Chen Yu, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Tim Chen,
	Aubrey Li, Zhao Liu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

When a system becomes busy and a process’s preferred LLC is
saturated with too many threads, tasks within that LLC migrate
frequently. These in LLC migrations introduce latency and degrade
performance. To avoid this, task aggregation should be suppressed when
the preferred LLC is overloaded, which requires a metric to indicate
LLC utilization.

Record per LLC utilization/cpu capacity during periodic load
balancing. These statistics will be used in later patches to decide
whether tasks should be aggregated into their preferred LLC.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/sched/topology.h |  4 ++
 kernel/sched/fair.c            | 73 ++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 5263746b63e8..fa25db00fdb6 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -77,6 +77,10 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned long	util_avg;
+	unsigned long	capacity ____cacheline_aligned_in_smp;
+#endif
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a2ea002f4fd6..1ebb0d99a906 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9559,6 +9559,29 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
 	return 0;
 }
 
+#ifdef CONFIG_SCHED_CACHE
+/* Called from load balancing paths with rcu_read_lock held */
+static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
+					 unsigned long *cap)
+{
+	struct sched_domain_shared *sd_share;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (!sd_share)
+		return false;
+
+	*util = READ_ONCE(sd_share->util_avg);
+	*cap = READ_ONCE(sd_share->capacity);
+
+	return true;
+}
+#else
+static inline bool get_llc_stats(int cpu, unsigned long *util,
+				 unsigned long *cap)
+{
+	return false;
+}
+#endif
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -10529,6 +10552,55 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
 
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Record the statistics for this scheduler group for later
+ * use. These values guide load balancing on aggregating tasks
+ * to a LLC.
+ */
+static void record_sg_llc_stats(struct lb_env *env,
+				struct sg_lb_stats *sgs,
+				struct sched_group *group)
+{
+	/*
+	 * Find the child domain on env->dst_cpu. This domain
+	 * is either the domain that spans this group(if the
+	 * group is a local group), or the sibling domain of
+	 * this group.
+	 */
+	struct sched_domain *sd = env->sd->child;
+	struct sched_domain_shared *sd_share;
+
+	if (!sched_feat(SCHED_CACHE) || env->idle == CPU_NEWLY_IDLE)
+		return;
+
+	/* only care about sched domains spanning a LLC */
+	if (sd != rcu_dereference(per_cpu(sd_llc, env->dst_cpu)))
+		return;
+
+	/*
+	 * At this point we know this group spans a LLC domain.
+	 * Record the statistic of this group in its corresponding
+	 * shared LLC domain.
+	 */
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared,
+					   cpumask_first(sched_group_span(group))));
+	if (!sd_share)
+		return;
+
+	if (READ_ONCE(sd_share->util_avg) != sgs->group_util)
+		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
+
+	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
+		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
+}
+#else
+static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
+				       struct sched_group *group)
+{
+}
+#endif
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10618,6 +10690,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
+	record_sg_llc_stats(env, sgs, group);
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type == group_overloaded)
 		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 03/19] sched/fair: Introduce helper functions to enforce LLC migration policy
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
  2025-10-11 18:24 ` [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing Tim Chen
  2025-10-11 18:24 ` [PATCH 02/19] sched/fair: Record per-LLC utilization to guide cache-aware scheduling decisions Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-11 18:24 ` [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs Tim Chen
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Chen Yu, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Tim Chen,
	Aubrey Li, Zhao Liu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Cache-aware scheduling aggregates threads onto their preferred LLC,
mainly through load balancing. When the preferred LLC becomes
saturated, more threads are still placed there, increasing latency.
A mechanism is needed to limit aggregation so that the preferred LLC
does not become overloaded.

Introduce helper functions can_migrate_llc() and
can_migrate_llc_task() to enforce the LLC migration policy:

  1. Aggregate a task to its preferred LLC if both source and
     destination LLCs are not too busy (<50% utilization, tunable),
     or if doing so will not leave the preferred LLC much more
     imbalanced than the non-preferred one (>20% utilization
     difference, tunable, similar to imbalance_pct of the LLC domain).
  2. Allow moving a task from overloaded preferred LLC to a non preferred
     LLC if this will not cause the non preferred LLC to become
     too imbalanced to cause a later migration back.
  3. If both LLCs are too busy, let the generic load balance to spread
     the tasks.

This hysteresis prevents tasks from being migrated into and out of the
preferred LLC frequently (back and forth): the threshold for migrating
a task out of its preferred LLC is higher than that for migrating it
into the LLC.

Since aggregation tends to make the preferred LLC busier than others,
the imbalance tolerance is controlled by llc_imb_pct. If set to 0,
tasks may still aggregate to the preferred LLC as long as it is
not more utilized than the source LLC, preserving the preference.

Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/debug.c |   4 ++
 kernel/sched/fair.c  | 145 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |   5 ++
 3 files changed, 154 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 02e16b70a790..57bb04ebbf96 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -523,6 +523,10 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_SCHED_CACHE
+	debugfs_create_u32("llc_overload_pct", 0644, debugfs_sched, &llc_overload_pct);
+	debugfs_create_u32("llc_imb_pct", 0644, debugfs_sched, &llc_imb_pct);
+#endif
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
 	debugfs_fair_server_init();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ebb0d99a906..cd080468ddc9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1205,6 +1205,9 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 #define EPOCH_PERIOD	(HZ / 100)	/* 10 ms */
 #define EPOCH_LLC_AFFINITY_TIMEOUT	5	/* 50 ms */
 
+__read_mostly unsigned int llc_overload_pct       = 50;
+__read_mostly unsigned int llc_imb_pct            = 20;
+
 static int llc_id(int cpu)
 {
 	if (cpu < 0)
@@ -9560,6 +9563,27 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
 }
 
 #ifdef CONFIG_SCHED_CACHE
+/*
+ * The margin used when comparing LLC utilization with CPU capacity.
+ * Parameter llc_overload_pct determines the LLC load level where
+ * active LLC aggregation is done.
+ * Derived from fits_capacity().
+ *
+ * (default: ~50%)
+ */
+#define fits_llc_capacity(util, max)	\
+	((util) * 100 < (max) * llc_overload_pct)
+
+/*
+ * The margin used when comparing utilization.
+ * is 'util1' noticeably greater than 'util2'
+ * Derived from capacity_greater().
+ * Bias is in perentage.
+ */
+/* Allows dst util to be bigger than src util by up to bias percent */
+#define util_greater(util1, util2) \
+	((util1) * 100 > (util2) * (100 + llc_imb_pct))
+
 /* Called from load balancing paths with rcu_read_lock held */
 static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
 					 unsigned long *cap)
@@ -9575,6 +9599,127 @@ static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
 
 	return true;
 }
+
+/*
+ * Decision matrix according to the LLC utilization. To
+ * decide whether we can do task aggregation across LLC.
+ *
+ * By default, 50% is the threshold to treat the LLC as busy,
+ * and 20% is the utilization imbalance percentage to decide
+ * if the preferred LLC is busier than the non-preferred LLC.
+ *
+ * 1. moving towards the preferred LLC, dst is the preferred
+ *    LLC, src is not.
+ *
+ * src \ dst      30%  40%  50%  60%
+ * 30%            Y    Y    Y    N
+ * 40%            Y    Y    Y    Y
+ * 50%            Y    Y    G    G
+ * 60%            Y    Y    G    G
+ *
+ * 2. moving out of the preferred LLC, src is the preferred
+ *    LLC, dst is not:
+ *
+ * src \ dst      30%  40%  50%  60%
+ * 30%            N    N    N    N
+ * 40%            N    N    N    N
+ * 50%            N    N    G    G
+ * 60%            Y    N    G    G
+ *
+ * src :      src_util
+ * dst :      dst_util
+ * Y :        Yes, migrate
+ * N :        No, do not migrate
+ * G :        let the Generic load balance to even the load.
+ *
+ * The intention is that if both LLCs are quite busy, cache aware
+ * load balance should not be performed, and generic load balance
+ * should take effect. However, if one is busy and the other is not,
+ * the preferred LLC capacity(50%) and imbalance criteria(20%) should
+ * be considered to determine whether LLC aggregation should be
+ * performed to bias the load towards the preferred LLC.
+ */
+
+/* migration decision, 3 states are orthogonal. */
+enum llc_mig {
+	mig_forbid = 0,		/* N: Don't migrate task, respect LLC preference */
+	mig_llc,		/* Y: Do LLC preference based migration */
+	mig_unrestricted	/* G: Don't restrict generic load balance migration */
+};
+
+static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
+				    unsigned long tsk_util,
+				    bool to_pref)
+{
+	unsigned long src_util, dst_util, src_cap, dst_cap;
+
+	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
+	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
+		return mig_unrestricted;
+
+	if (!fits_llc_capacity(dst_util, dst_cap) &&
+	    !fits_llc_capacity(src_util, src_cap))
+		return mig_unrestricted;
+
+	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
+	dst_util = dst_util + tsk_util;
+	if (to_pref) {
+		/*
+		 * llc_imb_pct is the imbalance allowed between
+		 * preferred LLC and non-preferred LLC.
+		 * Don't migrate if we will get preferred LLC too
+		 * heavily loaded and if the dest is much busier
+		 * than the src, in which case migration will
+		 * increase the imbalance too much.
+		 */
+		if (!fits_llc_capacity(dst_util, dst_cap) &&
+		    util_greater(dst_util, src_util))
+			return mig_forbid;
+	} else {
+		/*
+		 * Don't migrate if we will leave preferred LLC
+		 * too idle, or if this migration leads to the
+		 * non-preferred LLC falls within sysctl_aggr_imb percent
+		 * of preferred LLC, leading to migration again
+		 * back to preferred LLC.
+		 */
+		if (fits_llc_capacity(src_util, src_cap) ||
+		    !util_greater(src_util, dst_util))
+			return mig_forbid;
+	}
+	return mig_llc;
+}
+
+/*
+ * Check if task p can migrate from src_cpu to dst_cpu
+ * in terms of cache aware load balance.
+ */
+static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
+							struct task_struct *p)
+{
+	struct mm_struct *mm;
+	bool to_pref;
+	int cpu;
+
+	mm = p->mm;
+	if (!mm)
+		return mig_unrestricted;
+
+	cpu = mm->mm_sched_cpu;
+	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
+		return mig_unrestricted;
+
+	if (cpus_share_cache(dst_cpu, cpu))
+		to_pref = true;
+	else if (cpus_share_cache(src_cpu, cpu))
+		to_pref = false;
+	else
+		return mig_unrestricted;
+
+	return can_migrate_llc(src_cpu, dst_cpu,
+			       task_util(p), to_pref);
+}
+
 #else
 static inline bool get_llc_stats(int cpu, unsigned long *util,
 				 unsigned long *cap)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2ded8d3d0ecc..a52c96064b36 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2797,6 +2797,11 @@ extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_hot_threshold;
 
+#ifdef CONFIG_SCHED_CACHE
+extern unsigned int llc_overload_pct;
+extern unsigned int llc_imb_pct;
+#endif
+
 #ifdef CONFIG_SCHED_HRTICK
 
 /*
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (2 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 03/19] sched/fair: Introduce helper functions to enforce LLC migration policy Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-15 11:04   ` Peter Zijlstra
  2025-10-27  5:42   ` K Prateek Nayak
  2025-10-11 18:24 ` [PATCH 05/19] sched/fair: Add LLC index mapping for CPUs Tim Chen
                   ` (15 subsequent siblings)
  19 siblings, 2 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Chen Yu, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Tim Chen,
	Aubrey Li, Zhao Liu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Enable cache-aware load balancing only if at least 1 NUMA node has
more than one LLC.

Suggested-by: Libo Chen <libo.chen@oracle.com>
Suggested-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c     | 15 ++++++++++++---
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c | 14 ++++++++++++--
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cd080468ddc9..3d643449c48c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1208,6 +1208,14 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 __read_mostly unsigned int llc_overload_pct       = 50;
 __read_mostly unsigned int llc_imb_pct            = 20;
 
+DEFINE_STATIC_KEY_FALSE(sched_cache_allowed);
+
+static inline bool sched_cache_enabled(void)
+{
+	return sched_feat(SCHED_CACHE) &&
+		static_branch_likely(&sched_cache_allowed);
+}
+
 static int llc_id(int cpu)
 {
 	if (cpu < 0)
@@ -1294,7 +1302,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	struct mm_sched *pcpu_sched;
 	unsigned long epoch;
 
-	if (!sched_feat(SCHED_CACHE))
+	if (!sched_cache_enabled())
 		return;
 
 	if (p->sched_class != &fair_sched_class)
@@ -1330,7 +1338,7 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
 	struct callback_head *work = &p->cache_work;
 	struct mm_struct *mm = p->mm;
 
-	if (!sched_feat(SCHED_CACHE))
+	if (!sched_cache_enabled())
 		return;
 
 	if (!mm || !mm->pcpu_sched)
@@ -10716,7 +10724,8 @@ static void record_sg_llc_stats(struct lb_env *env,
 	struct sched_domain *sd = env->sd->child;
 	struct sched_domain_shared *sd_share;
 
-	if (!sched_feat(SCHED_CACHE) || env->idle == CPU_NEWLY_IDLE)
+	if (!sched_cache_enabled() ||
+	    env->idle == CPU_NEWLY_IDLE)
 		return;
 
 	/* only care about sched domains spanning a LLC */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a52c96064b36..60f1e51685ec 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2800,6 +2800,7 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
 #ifdef CONFIG_SCHED_CACHE
 extern unsigned int llc_overload_pct;
 extern unsigned int llc_imb_pct;
+extern struct static_key_false sched_cache_allowed;
 #endif
 
 #ifdef CONFIG_SCHED_HRTICK
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 6e2f54169e66..2675db980f70 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2444,6 +2444,7 @@ static int
 build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr)
 {
 	enum s_alloc alloc_state = sa_none;
+	bool has_multi_llcs = false;
 	struct sched_domain *sd;
 	struct s_data d;
 	struct rq *rq = NULL;
@@ -2530,10 +2531,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 				 * between LLCs and memory channels.
 				 */
 				nr_llcs = sd->span_weight / child->span_weight;
-				if (nr_llcs == 1)
+				if (nr_llcs == 1) {
 					imb = sd->span_weight >> 3;
-				else
+				} else {
 					imb = nr_llcs;
+					has_multi_llcs = true;
+				}
 				imb = max(1U, imb);
 				sd->imb_numa_nr = imb;
 
@@ -2581,6 +2584,13 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	if (has_cluster)
 		static_branch_inc_cpuslocked(&sched_cluster_active);
 
+#ifdef CONFIG_SCHED_CACHE
+	if (has_multi_llcs) {
+		static_branch_enable_cpuslocked(&sched_cache_allowed);
+		pr_info("Cache aware load balance enabled.\n");
+	}
+#endif
+
 	if (rq && sched_debug_verbose)
 		pr_info("root domain span: %*pbl\n", cpumask_pr_args(cpu_map));
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 05/19] sched/fair: Add LLC index mapping for CPUs
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (3 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-15 11:08   ` Peter Zijlstra
  2025-10-15 11:58   ` Peter Zijlstra
  2025-10-11 18:24 ` [PATCH 06/19] sched/fair: Assign preferred LLC ID to processes Tim Chen
                   ` (14 subsequent siblings)
  19 siblings, 2 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

Introduce an index mapping between CPUs and their LLCs. This provides
a continuous per LLC index needed for cache-aware load balancing in
later patches.

The existing per_cpu llc_id usually points to the first CPU of the
LLC domain, which is sparse and unsuitable as an array index. Using
llc_id directly would waste memory.

With the new mapping, CPUs in the same LLC share a continuous index:

  per_cpu(llc_idx, CPU=0...15)  = 0
  per_cpu(llc_idx, CPU=16...31) = 1
  per_cpu(llc_idx, CPU=32...47) = 2
  ...

The maximum number of LLCs is limited by CONFIG_NR_LLCS. If the number
of LLCs available exceeds CONFIG_NR_LLCS, the cache aware load balance
is disabled. To further save memory, this array could be converted to
dynamic allocation in the future, or the LLC index could be made NUMA
node-wide.

As mentioned by Adam, if there is no domain with SD_SHARE_LLC, the
function update_llc_idx() should not be invoked to update the index;
otherwise, it will generate an invalid index.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/threads.h | 10 +++++++++
 init/Kconfig            |  9 ++++++++
 kernel/sched/fair.c     | 11 ++++++++++
 kernel/sched/sched.h    |  2 ++
 kernel/sched/topology.c | 47 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 79 insertions(+)

diff --git a/include/linux/threads.h b/include/linux/threads.h
index 1674a471b0b4..2c9b1adfe024 100644
--- a/include/linux/threads.h
+++ b/include/linux/threads.h
@@ -20,6 +20,16 @@
 /* Places which use this should consider cpumask_var_t. */
 #define NR_CPUS		CONFIG_NR_CPUS
 
+#ifndef CONFIG_NR_LLCS
+#define CONFIG_NR_LLCS 1
+#endif
+
+#if CONFIG_NR_LLCS > NR_CPUS
+#define NR_LLCS		NR_CPUS
+#else
+#define NR_LLCS		CONFIG_NR_LLCS
+#endif
+
 #define MIN_THREADS_LEFT_FOR_ROOT 4
 
 /*
diff --git a/init/Kconfig b/init/Kconfig
index 4e625db7920a..6e4c96ccdda0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -981,6 +981,15 @@ config SCHED_CACHE
 	  resources within the same cache domain, reducing cache misses and
 	  lowering data access latency.
 
+config NR_LLCS
+	int "Maximum number of Last Level Caches"
+	range 2 1024
+	depends on SMP && SCHED_CACHE
+	default 64
+	help
+	  This allows you to specify the maximum number of last level caches
+	  this kernel will support for cache aware scheduling.
+
 config NUMA_BALANCING_DEFAULT_ENABLED
 	bool "Automatically enable NUMA aware memory/task placement"
 	default y
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d643449c48c..61c129bde8b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1224,6 +1224,17 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
 
+/*
+ * continuous LLC index, starting from 0.
+ */
+static inline int llc_idx(int cpu)
+{
+	if (cpu < 0)
+		return -1;
+
+	return per_cpu(sd_llc_idx, cpu);
+}
+
 void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
 {
 	unsigned long epoch;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 60f1e51685ec..b448ad6dc51d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2039,6 +2039,7 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(int, sd_llc_idx);
 DECLARE_PER_CPU(int, sd_share_id);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
@@ -2047,6 +2048,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
 
 extern struct static_key_false sched_asym_cpucapacity;
 extern struct static_key_false sched_cluster_active;
+extern int max_llcs;
 
 static __always_inline bool sched_asym_cpucap_active(void)
 {
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 2675db980f70..4bd033060f1d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -659,6 +659,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_llc_idx);
 DEFINE_PER_CPU(int, sd_share_id);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
@@ -668,6 +669,40 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
 DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
 DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
 
+int max_llcs = -1;
+
+static void update_llc_idx(int cpu)
+{
+#ifdef CONFIG_SCHED_CACHE
+	int idx = -1, llc_id = -1;
+
+	if (max_llcs > NR_LLCS)
+		return;
+
+	llc_id = per_cpu(sd_llc_id, cpu);
+	idx = per_cpu(sd_llc_idx, llc_id);
+
+	/*
+	 * A new LLC is detected, increase the index
+	 * by 1.
+	 */
+	if (idx < 0) {
+		idx = max_llcs++;
+
+		if (max_llcs > NR_LLCS) {
+			if (static_branch_unlikely(&sched_cache_allowed))
+				static_branch_disable_cpuslocked(&sched_cache_allowed);
+
+			pr_warn_once("CONFIG_NR_LLCS is too small, disable cache aware load balance\n");
+			return;
+		}
+
+		per_cpu(sd_llc_idx, llc_id) = idx;
+	}
+	per_cpu(sd_llc_idx, cpu) = idx;
+#endif
+}
+
 static void update_top_cache_domain(int cpu)
 {
 	struct sched_domain_shared *sds = NULL;
@@ -687,6 +722,10 @@ static void update_top_cache_domain(int cpu)
 	per_cpu(sd_llc_id, cpu) = id;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
+	/* only update the llc index for domain with SD_SHARE_LLC */
+	if (sd)
+		update_llc_idx(cpu);
+
 	sd = lowest_flag_domain(cpu, SD_CLUSTER);
 	if (sd)
 		id = cpumask_first(sched_domain_span(sd));
@@ -2452,6 +2491,14 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	bool has_asym = false;
 	bool has_cluster = false;
 
+#ifdef CONFIG_SCHED_CACHE
+	if (max_llcs < 0) {
+		for_each_possible_cpu(i)
+			per_cpu(sd_llc_idx, i) = -1;
+		max_llcs = 0;
+	}
+#endif
+
 	if (WARN_ON(cpumask_empty(cpu_map)))
 		goto error;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 06/19] sched/fair: Assign preferred LLC ID to processes
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (4 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 05/19] sched/fair: Add LLC index mapping for CPUs Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-14  5:16   ` Chen, Yu C
  2025-10-11 18:24 ` [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue Tim Chen
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

With cache-aware scheduling enabled, each task is assigned a
preferred LLC ID. This allows quick identification of the LLC domain
where the task prefers to run, similar to numa_preferred_nid in
NUMA balancing.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/sched.h | 1 +
 init/init_task.c      | 3 +++
 kernel/sched/fair.c   | 7 +++++++
 3 files changed, 11 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d7ddb7ce6c4b..8a5e4038cd5c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1402,6 +1402,7 @@ struct task_struct {
 
 #ifdef CONFIG_SCHED_CACHE
 	struct callback_head		cache_work;
+	int				preferred_llc;
 #endif
 
 #ifdef CONFIG_RSEQ
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..5fffbe766f57 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -188,6 +188,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.numa_group	= NULL,
 	.numa_faults	= NULL,
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	.preferred_llc  = -1,
+#endif
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 	.kasan_depth	= 1,
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61c129bde8b6..d6167a029c47 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1312,6 +1312,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	struct mm_struct *mm = p->mm;
 	struct mm_sched *pcpu_sched;
 	unsigned long epoch;
+	int mm_sched_llc = -1;
 
 	if (!sched_cache_enabled())
 		return;
@@ -1342,6 +1343,12 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 		if (mm->mm_sched_cpu != -1)
 			mm->mm_sched_cpu = -1;
 	}
+
+	if (mm->mm_sched_cpu != -1)
+		mm_sched_llc = per_cpu(sd_llc_id, mm->mm_sched_cpu);
+
+	if (p->preferred_llc != mm_sched_llc)
+		p->preferred_llc = mm_sched_llc;
 }
 
 static void task_tick_cache(struct rq *rq, struct task_struct *p)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (5 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 06/19] sched/fair: Assign preferred LLC ID to processes Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-15 12:05   ` Peter Zijlstra
  2025-10-27  6:04   ` K Prateek Nayak
  2025-10-11 18:24 ` [PATCH 08/19] sched/fair: Introduce per runqueue task LLC preference counter Tim Chen
                   ` (12 subsequent siblings)
  19 siblings, 2 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

For each runqueue, track the number of tasks with an LLC preference
and how many of them are running on their preferred LLC. This mirrors
nr_numa_running and nr_preferred_running for NUMA balancing, and will
be used by cache-aware load balancing in later patches.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c  | 12 +++++++++++
 kernel/sched/fair.c  | 47 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  7 +++++++
 3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 79d15e904d12..5940756e2da3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -529,6 +529,18 @@ void __trace_set_current_state(int state_value)
 }
 EXPORT_SYMBOL(__trace_set_current_state);
 
+#ifdef CONFIG_SMP
+int task_llc(const struct task_struct *p)
+{
+	return per_cpu(sd_llc_id, task_cpu(p));
+}
+#else
+int task_llc(const struct task_struct *p)
+{
+	return 0;
+}
+#endif
+
 /*
  * Serialization rules:
  *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d6167a029c47..fd315937c0cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1235,6 +1235,24 @@ static inline int llc_idx(int cpu)
 	return per_cpu(sd_llc_idx, cpu);
 }
 
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
+{
+	if (!sched_cache_enabled())
+		return;
+
+	rq->nr_llc_running += (p->preferred_llc != -1);
+	rq->nr_pref_llc_running += (p->preferred_llc == task_llc(p));
+}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
+{
+	if (!sched_cache_enabled())
+		return;
+
+	rq->nr_llc_running -= (p->preferred_llc != -1);
+	rq->nr_pref_llc_running -= (p->preferred_llc == task_llc(p));
+}
+
 void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
 {
 	unsigned long epoch;
@@ -1306,6 +1324,8 @@ static unsigned long __no_profile fraction_mm_sched(struct rq *rq, struct mm_sch
 	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
 }
 
+static unsigned int task_running_on_cpu(int cpu, struct task_struct *p);
+
 static inline
 void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
@@ -1347,8 +1367,13 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	if (mm->mm_sched_cpu != -1)
 		mm_sched_llc = per_cpu(sd_llc_id, mm->mm_sched_cpu);
 
-	if (p->preferred_llc != mm_sched_llc)
+	/* task not on rq accounted later in account_entity_enqueue() */
+	if (task_running_on_cpu(rq->cpu, p) &&
+	    p->preferred_llc != mm_sched_llc) {
+		account_llc_dequeue(rq, p);
 		p->preferred_llc = mm_sched_llc;
+		account_llc_enqueue(rq, p);
+	}
 }
 
 static void task_tick_cache(struct rq *rq, struct task_struct *p)
@@ -1497,6 +1522,15 @@ void init_sched_mm(struct task_struct *p)
 	work->next = work;
 }
 
+void reset_llc_stats(struct rq *rq)
+{
+	if (!sched_cache_enabled())
+		return;
+
+	rq->nr_llc_running = 0;
+	rq->nr_pref_llc_running = 0;
+}
+
 #else
 
 static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
@@ -1506,6 +1540,11 @@ void init_sched_mm(struct task_struct *p) { }
 
 static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
 
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p) {}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p) {}
+
+void reset_llc_stats(struct rq *rq) {}
 #endif
 
 /*
@@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		struct rq *rq = rq_of(cfs_rq);
 
 		account_numa_enqueue(rq, task_of(se));
+		account_llc_enqueue(rq, task_of(se));
 		list_add(&se->group_node, &rq->cfs_tasks);
 	}
 	cfs_rq->nr_queued++;
@@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (entity_is_task(se)) {
 		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+		account_llc_dequeue(rq_of(cfs_rq), task_of(se));
 		list_del_init(&se->group_node);
 	}
 	cfs_rq->nr_queued--;
+
+	/* safeguard to clear the cache aware data */
+	if (!parent_entity(se) && !cfs_rq->nr_queued)
+		reset_llc_stats(rq_of(cfs_rq));
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b448ad6dc51d..3ab64067acc6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1098,6 +1098,10 @@ struct rq {
 	unsigned int		nr_preferred_running;
 	unsigned int		numa_migrate_on;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int		nr_pref_llc_running;
+	unsigned int		nr_llc_running;
+#endif
 #ifdef CONFIG_NO_HZ_COMMON
 	unsigned long		last_blocked_load_update_tick;
 	unsigned int		has_blocked_load;
@@ -1952,6 +1956,9 @@ init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 
 #endif /* !CONFIG_NUMA_BALANCING */
 
+void reset_llc_stats(struct rq *rq);
+int task_llc(const struct task_struct *p);
+
 static inline void
 queue_balance_callback(struct rq *rq,
 		       struct balance_callback *head,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 08/19] sched/fair: Introduce per runqueue task LLC preference counter
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (6 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-15 12:21   ` Peter Zijlstra
  2025-10-11 18:24 ` [PATCH 09/19] sched/fair: Count tasks prefering each LLC in a sched group Tim Chen
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

Each runqueue is assigned a static array where each element tracks
the number of tasks preferring a given LLC, indexed from 0 to
NR_LLCS.

For example, rq->nr_pref_llc[3] = 2 signifies that there are 2 tasks on
this runqueue which prefer to run within LLC3 (indexed from 0 to NR_LLCS

The load balancer can use this information to identify busy runqueues
and migrate tasks to their preferred LLC domains.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c  | 35 +++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 36 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd315937c0cf..b7a68fe7601b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1235,22 +1235,51 @@ static inline int llc_idx(int cpu)
 	return per_cpu(sd_llc_idx, cpu);
 }
 
+static inline int pref_llc_idx(struct task_struct *p)
+{
+	return llc_idx(p->preferred_llc);
+}
+
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
+	int pref_llc;
+
 	if (!sched_cache_enabled())
 		return;
 
 	rq->nr_llc_running += (p->preferred_llc != -1);
 	rq->nr_pref_llc_running += (p->preferred_llc == task_llc(p));
+
+	if (p->preferred_llc < 0)
+		return;
+
+	pref_llc = pref_llc_idx(p);
+	if (pref_llc < 0)
+		return;
+
+	++rq->nr_pref_llc[pref_llc];
 }
 
 static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
 {
+	int pref_llc;
+
 	if (!sched_cache_enabled())
 		return;
 
 	rq->nr_llc_running -= (p->preferred_llc != -1);
 	rq->nr_pref_llc_running -= (p->preferred_llc == task_llc(p));
+
+	if (p->preferred_llc < 0)
+		return;
+
+	pref_llc = pref_llc_idx(p);
+	if (pref_llc < 0)
+		return;
+
+	/* avoid negative counter */
+	if (rq->nr_pref_llc[pref_llc] > 0)
+		--rq->nr_pref_llc[pref_llc];
 }
 
 void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
@@ -1524,10 +1553,16 @@ void init_sched_mm(struct task_struct *p)
 
 void reset_llc_stats(struct rq *rq)
 {
+	int i = 0;
+
 	if (!sched_cache_enabled())
 		return;
 
 	rq->nr_llc_running = 0;
+
+	for (i = 0; i < max_llcs; ++i)
+		rq->nr_pref_llc[i] = 0;
+
 	rq->nr_pref_llc_running = 0;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3ab64067acc6..b801d32d5fba 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1101,6 +1101,7 @@ struct rq {
 #ifdef CONFIG_SCHED_CACHE
 	unsigned int		nr_pref_llc_running;
 	unsigned int		nr_llc_running;
+	unsigned int		nr_pref_llc[NR_LLCS];
 #endif
 #ifdef CONFIG_NO_HZ_COMMON
 	unsigned long		last_blocked_load_update_tick;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 09/19] sched/fair: Count tasks prefering each LLC in a sched group
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (7 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 08/19] sched/fair: Introduce per runqueue task LLC preference counter Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-15 12:22   ` Peter Zijlstra
                     ` (2 more replies)
  2025-10-11 18:24 ` [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing Tim Chen
                   ` (10 subsequent siblings)
  19 siblings, 3 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

During LLC load balancing, tabulate the number of tasks on each runqueue
that prefer a given destination LLC in a sched group.

For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
selected as the busiest source to pick tasks from.

Within a source LLC, the total number of tasks preferring a destination
LLC is computed by summing counts across all CPUs in that runqueue. For
instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
LLC3, the total for LLC0 is 3.

These statistics allow the load balancer to choose tasks from source
sched groups that best match their preferred LLCs.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7a68fe7601b..cbd1e97bca4b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10399,6 +10399,9 @@ struct sg_lb_stats {
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int nr_pref_llc[NR_LLCS];
+#endif
 };
 
 /*
@@ -10891,6 +10894,14 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if (cpu_overutilized(i))
 			*sg_overutilized = 1;
 
+#ifdef CONFIG_SCHED_CACHE
+		if (sched_cache_enabled()) {
+			int j;
+
+			for (j = 0; j < max_llcs; ++j)
+				sgs->nr_pref_llc[j] += rq->nr_pref_llc[j];
+		}
+#endif
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (8 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 09/19] sched/fair: Count tasks prefering each LLC in a sched group Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-15  7:23   ` kernel test robot
                     ` (4 more replies)
  2025-10-11 18:24 ` [PATCH 11/19] sched/fair: Identify busiest sched_group for LLC-aware load balancing Tim Chen
                   ` (9 subsequent siblings)
  19 siblings, 5 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

During LLC load balancing, first check for tasks that prefer the
destination LLC and balance them to it before others.

Mark source sched groups containing tasks preferring non local LLCs
with the group_llc_balance flag. This ensures the load balancer later
pulls or pushes these tasks toward their preferred LLCs.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 41 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cbd1e97bca4b..af7b578eaa06 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9822,8 +9822,7 @@ static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu
 	else
 		return mig_unrestricted;
 
-	return can_migrate_llc(src_cpu, dst_cpu,
-			       task_util(p), to_pref);
+	return can_migrate_llc(src_cpu, dst_cpu, task_util(p), to_pref);
 }
 
 #else
@@ -10394,6 +10393,7 @@ struct sg_lb_stats {
 	enum group_type group_type;
 	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
 	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
+	unsigned int group_llc_balance;		/* Tasks should be moved to preferred LLC */
 	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -10849,11 +10849,45 @@ static void record_sg_llc_stats(struct lb_env *env,
 	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
 		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
 }
+
+/*
+ * Do LLC balance on sched group that contains LLC, and have tasks preferring
+ * to run on LLC in idle dst_cpu.
+ */
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+			       struct sched_group *group)
+{
+	struct sched_domain *child = env->sd->child;
+	int llc;
+
+	if (!sched_cache_enabled())
+		return false;
+
+	if (env->sd->flags & SD_SHARE_LLC)
+		return false;
+
+	/* only care about task migration among LLCs */
+	if (child && !(child->flags & SD_SHARE_LLC))
+		return false;
+
+	llc = llc_idx(env->dst_cpu);
+	if (sgs->nr_pref_llc[llc] > 0 &&
+	    can_migrate_llc(env->src_cpu, env->dst_cpu, 0, true) == mig_llc)
+		return true;
+
+	return false;
+}
 #else
 static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
 				       struct sched_group *group)
 {
 }
+
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+			       struct sched_group *group)
+{
+	return false;
+}
 #endif
 
 /**
@@ -10954,6 +10988,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
 	record_sg_llc_stats(env, sgs, group);
+
+	/* Check for tasks in this group can be moved to their preferred LLC */
+	if (!local_group && llc_balance(env, sgs, group))
+		sgs->group_llc_balance = 1;
+
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type == group_overloaded)
 		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 11/19] sched/fair: Identify busiest sched_group for LLC-aware load balancing
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (9 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-15 15:24   ` Peter Zijlstra
  2025-10-11 18:24 ` [PATCH 12/19] sched/fair: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

The load balancer selects the busiest sched_group and migrates tasks
to less busy groups to distribute load across CPUs.

With cache-aware scheduling enabled, the busiest sched_group is
the one with most tasks preferring the destination LLC. If
the group has the llc_balance flag set, cache aware load balancing is
triggered.

Introduce the helper function update_llc_busiest() to identify the
sched_group with the most tasks preferring the destination LLC.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index af7b578eaa06..8469ec528cb1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10877,6 +10877,23 @@ static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
 
 	return false;
 }
+
+static bool update_llc_busiest(struct lb_env *env,
+			       struct sg_lb_stats *busiest,
+			       struct sg_lb_stats *sgs)
+{
+	int idx;
+
+	/* Only the candidate with llc_balance needs to be taken care of */
+	if (!sgs->group_llc_balance)
+		return false;
+
+	/*
+	 * There are more tasks that want to run on dst_cpu's LLC.
+	 */
+	idx = llc_idx(env->dst_cpu);
+	return sgs->nr_pref_llc[idx] > busiest->nr_pref_llc[idx];
+}
 #else
 static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
 				       struct sched_group *group)
@@ -10888,6 +10905,13 @@ static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
 {
 	return false;
 }
+
+static bool update_llc_busiest(struct lb_env *env,
+			       struct sg_lb_stats *busiest,
+			       struct sg_lb_stats *sgs)
+{
+	return false;
+}
 #endif
 
 /**
@@ -11035,6 +11059,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
+	/* deal with prefer LLC load balance, if failed, fall into normal load balance */
+	if (update_llc_busiest(env, busiest, sgs))
+		return true;
+
+	/*
+	 * If the busiest group has tasks with LLC preference,
+	 * skip normal load balance.
+	 */
+	if (busiest->group_llc_balance)
+		return false;
+
 	if (sgs->group_type > busiest->group_type)
 		return true;
 
@@ -11942,9 +11977,11 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
 	/*
 	 * Try to move all excess tasks to a sibling domain of the busiest
 	 * group's child domain.
+	 * Also do so if we can move some tasks that prefer the local LLC.
 	 */
 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
-	    sibling_imbalance(env, &sds, busiest, local) > 1)
+	    (busiest->group_llc_balance ||
+	    sibling_imbalance(env, &sds, busiest, local) > 1))
 		goto force_balance;
 
 	if (busiest->group_type != group_overloaded) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 12/19] sched/fair: Add migrate_llc_task migration type for cache-aware balancing
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (10 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 11/19] sched/fair: Identify busiest sched_group for LLC-aware load balancing Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-27  9:04   ` K Prateek Nayak
  2025-10-11 18:24 ` [PATCH 13/19] sched/fair: Handle moving single tasks to/from their preferred LLC Tim Chen
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

Introduce a new migration type, migrate_llc_task, to support
cache-aware load balancing.

After identifying the busiest sched_group (having the most tasks
preferring the destination LLC), mark migrations with this type.
During load balancing, each runqueue in the busiest sched_group is
examined, and the runqueue with the highest number of tasks preferring
the destination CPU is selected as the busiest runqueue.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8469ec528cb1..bec6354d7841 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9504,7 +9504,8 @@ enum migration_type {
 	migrate_load = 0,
 	migrate_util,
 	migrate_task,
-	migrate_misfit
+	migrate_misfit,
+	migrate_llc_task
 };
 
 #define LBF_ALL_PINNED	0x01
@@ -10082,6 +10083,10 @@ static int detach_tasks(struct lb_env *env)
 			env->imbalance -= util;
 			break;
 
+		case migrate_llc_task:
+			env->imbalance--;
+			break;
+
 		case migrate_task:
 			env->imbalance--;
 			break;
@@ -11733,6 +11738,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		return;
 	}
 
+#ifdef CONFIG_SCHED_CACHE
+	if (busiest->group_llc_balance) {
+		/* Move a task that prefer local LLC */
+		env->migration_type = migrate_llc_task;
+		env->imbalance = 1;
+		return;
+	}
+#endif
+
 	if (busiest->group_type == group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -12041,6 +12055,10 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 	struct rq *busiest = NULL, *rq;
 	unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
 	unsigned int busiest_nr = 0;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int busiest_pref_llc = 0;
+	int dst_llc;
+#endif
 	int i;
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
@@ -12149,6 +12167,16 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 			}
 			break;
 
+		case migrate_llc_task:
+#ifdef CONFIG_SCHED_CACHE
+			dst_llc = llc_idx(env->dst_cpu);
+			if (!cpus_share_cache(env->dst_cpu, rq->cpu) &&
+			    busiest_pref_llc < rq->nr_pref_llc[dst_llc]) {
+				busiest_pref_llc = rq->nr_pref_llc[dst_llc];
+				busiest = rq;
+			}
+#endif
+			break;
 		case migrate_task:
 			if (busiest_nr < nr_running) {
 				busiest_nr = nr_running;
@@ -12331,6 +12359,8 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
 	case migrate_misfit:
 		__schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
 		break;
+	case migrate_llc_task:
+		break;
 	}
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 13/19] sched/fair: Handle moving single tasks to/from their preferred LLC
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (11 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 12/19] sched/fair: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-11 18:24 ` [PATCH 14/19] sched/fair: Consider LLC preference when selecting tasks for load balancing Tim Chen
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

If the busiest runqueue has only one task, active balancing may be
invoked to move it. However, before migration, check whether the task
is running on its preferred LLC.

Do not move a lone task to another LLC if it would move the task
away from its preferred LLC or cause excessive imbalance between LLCs.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 62 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 59 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bec6354d7841..19ba9c1b9a63 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9826,12 +9826,53 @@ static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu
 	return can_migrate_llc(src_cpu, dst_cpu, task_util(p), to_pref);
 }
 
+static inline bool
+break_llc_locality(struct lb_env *env)
+{
+	if (!sched_cache_enabled())
+		return false;
+
+	if (cpus_share_cache(env->src_cpu, env->dst_cpu))
+		return false;
+	/*
+	 * All tasks prefer to stay on their current CPU.
+	 * Do not pull a task from its preferred CPU if:
+	 * 1. It is the only task running there; OR
+	 * 2. Migrating it away from its preferred LLC would violate
+	 *    the cache-aware scheduling policy.
+	 */
+	if (env->src_rq->nr_pref_llc_running == env->src_rq->cfs.h_nr_runnable) {
+		unsigned long util = 0;
+		struct task_struct *cur;
+
+		if (env->src_rq->nr_running <= 1)
+			return true;
+
+		rcu_read_lock();
+		cur = rcu_dereference(env->src_rq->curr);
+		if (cur)
+			util = task_util(cur);
+		rcu_read_unlock();
+
+		if (can_migrate_llc(env->src_cpu, env->dst_cpu,
+				    util, false) == mig_forbid)
+			return true;
+	}
+
+	return false;
+}
 #else
 static inline bool get_llc_stats(int cpu, unsigned long *util,
 				 unsigned long *cap)
 {
 	return false;
 }
+
+static inline bool
+break_llc_locality(struct lb_env *env)
+{
+	return false;
+}
 #endif
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
@@ -12247,6 +12288,9 @@ static int need_active_balance(struct lb_env *env)
 {
 	struct sched_domain *sd = env->sd;
 
+	if (break_llc_locality(env))
+		return 0;
+
 	if (asym_active_balance(env))
 		return 1;
 
@@ -12266,7 +12310,8 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
-	if (env->migration_type == migrate_misfit)
+	if (env->migration_type == migrate_misfit ||
+	    env->migration_type == migrate_llc_task)
 		return 1;
 
 	return 0;
@@ -12711,9 +12756,20 @@ static int active_load_balance_cpu_stop(void *data)
 		goto out_unlock;
 
 	/* Is there any task to move? */
-	if (busiest_rq->nr_running <= 1)
-		goto out_unlock;
+	if (busiest_rq->nr_running <= 1) {
+#ifdef CONFIG_SCHED_CACHE
+		int llc = llc_idx(target_cpu);
 
+		if (!sched_cache_enabled())
+			goto out_unlock;
+
+		if (llc < 0)
+			goto out_unlock;
+		/* don't migrate if no task prefers target */
+		if (busiest_rq->nr_pref_llc[llc] < 1)
+#endif
+			goto out_unlock;
+	}
 	/*
 	 * This condition is "impossible", if it occurs
 	 * we need to fix it. Originally reported by
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 14/19] sched/fair: Consider LLC preference when selecting tasks for load balancing
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (12 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 13/19] sched/fair: Handle moving single tasks to/from their preferred LLC Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-11 18:24 ` [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach Tim Chen
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

Currently, task selection from the busiest runqueue ignores LLC
preferences. Reorder tasks in the busiest queue to prioritize selection
as follows:

  1. Tasks preferring the destination CPU's LLC
  2. Tasks with no LLC preference
  3. Tasks preferring an LLC different from their current one
  4. Tasks preferring the LLC they are currently on

This improves the likelihood that tasks are migrated to their
preferred LLC.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19ba9c1b9a63..0fafbfedb21d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10036,6 +10036,68 @@ static struct task_struct *detach_one_task(struct lb_env *env)
 	return NULL;
 }
 
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Prepare lists to detach tasks in the following order:
+ * 1. tasks that prefer dst cpu's LLC
+ * 2. tasks that have no preference in LLC
+ * 3. tasks that prefer LLC other than the ones they are on
+ * 4. tasks that prefer the LLC that they are currently on.
+ */
+static struct list_head
+*order_tasks_by_llc(struct lb_env *env, struct list_head *tasks)
+{
+	struct task_struct *p;
+	LIST_HEAD(pref_old_llc);
+	LIST_HEAD(pref_new_llc);
+	LIST_HEAD(no_pref_llc);
+	LIST_HEAD(pref_other_llc);
+
+	if (!sched_cache_enabled())
+		return tasks;
+
+	if (cpus_share_cache(env->dst_cpu, env->src_cpu))
+		return tasks;
+
+	while (!list_empty(tasks)) {
+		p = list_last_entry(tasks, struct task_struct, se.group_node);
+
+		if (p->preferred_llc == llc_id(env->dst_cpu)) {
+			list_move(&p->se.group_node, &pref_new_llc);
+			continue;
+		}
+
+		if (p->preferred_llc == llc_id(env->src_cpu)) {
+			list_move(&p->se.group_node, &pref_old_llc);
+			continue;
+		}
+
+		if (p->preferred_llc == -1) {
+			list_move(&p->se.group_node, &no_pref_llc);
+			continue;
+		}
+
+		list_move(&p->se.group_node, &pref_other_llc);
+	}
+
+	/*
+	 * We detach tasks from list tail in detach tasks.  Put tasks
+	 * to be chosen first at end of list.
+	 */
+	list_splice(&pref_new_llc, tasks);
+	list_splice(&no_pref_llc, tasks);
+	list_splice(&pref_other_llc, tasks);
+	list_splice(&pref_old_llc, tasks);
+	return tasks;
+}
+#else
+static inline struct list_head
+*order_tasks_by_llc(struct lb_env *env, struct list_head *tasks)
+{
+	return tasks;
+}
+#endif
+
 /*
  * detach_tasks() -- tries to detach up to imbalance load/util/tasks from
  * busiest_rq, as part of a balancing operation within domain "sd".
@@ -10044,7 +10106,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
  */
 static int detach_tasks(struct lb_env *env)
 {
-	struct list_head *tasks = &env->src_rq->cfs_tasks;
+	struct list_head *tasks;
 	unsigned long util, load;
 	struct task_struct *p;
 	int detached = 0;
@@ -10063,6 +10125,8 @@ static int detach_tasks(struct lb_env *env)
 	if (env->imbalance <= 0)
 		return 0;
 
+	tasks = order_tasks_by_llc(env, &env->src_rq->cfs_tasks);
+
 	while (!list_empty(tasks)) {
 		/*
 		 * We don't want to steal all, otherwise we may be treated likewise,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (13 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 14/19] sched/fair: Consider LLC preference when selecting tasks for load balancing Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-28  6:02   ` K Prateek Nayak
  2025-10-11 18:24 ` [PATCH 16/19] sched/fair: Exclude processes with many threads from cache-aware scheduling Tim Chen
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

During the final step of load balancing, can_migrate_task() now
considers a task's LLC preference before moving it out of its
preferred LLC.

Additionally, add checks in detach_tasks() to prevent selecting tasks
that prefer their current LLC.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fafbfedb21d..65ff7c306a2f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9801,8 +9801,8 @@ static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
  * Check if task p can migrate from src_cpu to dst_cpu
  * in terms of cache aware load balance.
  */
-static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
-							struct task_struct *p)
+static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
+					 struct task_struct *p)
 {
 	struct mm_struct *mm;
 	bool to_pref;
@@ -9969,6 +9969,12 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (env->flags & LBF_ACTIVE_LB)
 		return 1;
 
+#ifdef CONFIG_SCHED_CACHE
+	if (sched_cache_enabled() &&
+	    can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) == mig_forbid)
+		return 0;
+#endif
+
 	degrades = migrate_degrades_locality(p, env);
 	if (!degrades)
 		hot = task_hot(p, env);
@@ -10227,6 +10233,20 @@ static int detach_tasks(struct lb_env *env)
 		if (env->imbalance <= 0)
 			break;
 
+#ifdef CONFIG_SCHED_CACHE
+		/*
+		 * Don't detach more tasks if the remaining tasks want
+		 * to stay. We know the remaining tasks all prefer the
+		 * current LLC, because after order_tasks_by_llc(), the
+		 * tasks that prefer the current LLC are at the tail of
+		 * the list. The inhibition of detachment is to avoid too
+		 * many tasks being migrated out of the preferred LLC.
+		 */
+		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
+		    llc_id(env->src_cpu) == p->preferred_llc)
+			break;
+#endif
+
 		continue;
 next:
 		if (p->sched_task_hot)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 16/19] sched/fair: Exclude processes with many threads from cache-aware scheduling
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (14 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-23  7:22   ` kernel test robot
  2025-10-11 18:24 ` [PATCH 17/19] sched/fair: Disable cache aware scheduling for processes with high thread counts Tim Chen
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Chen Yu, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Tim Chen,
	Aubrey Li, Zhao Liu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

A performance regression was observed by Prateek when running hackbench
with many threads per process (high fd count). To avoid this, processes
with a large number of active threads are excluded from cache-aware
scheduling.

With sched_cache enabled, record the number of active threads in each
process during the periodic task_cache_work(). While iterating over
CPUs, if the currently running task belongs to the same process as the
task that launched task_cache_work(), increment the active thread count.

If the count exceeds the number of CPUs in the process's preferred LLC,
sched_cache will avoid aggregating too many threads into a single LLC
domain.

Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 14 ++++++++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3ca557c2f36d..b307f81b2fde 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1031,6 +1031,7 @@ struct mm_struct {
 		raw_spinlock_t mm_sched_lock;
 		unsigned long mm_sched_epoch;
 		int mm_sched_cpu;
+		u64 nr_running_avg ____cacheline_aligned_in_smp;
 #endif
 
 #ifdef CONFIG_MMU
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 65ff7c306a2f..79d109f8a09f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1451,12 +1451,13 @@ static void get_scan_cpumasks(cpumask_var_t cpus, int cache_cpu,
 
 static void __no_profile task_cache_work(struct callback_head *work)
 {
-	struct task_struct *p = current;
+	struct task_struct *p = current, *cur;
 	struct mm_struct *mm = p->mm;
 	unsigned long m_a_occ = 0;
 	unsigned long curr_m_a_occ = 0;
 	int cpu, m_a_cpu = -1, cache_cpu,
-	    pref_nid = NUMA_NO_NODE, curr_cpu;
+	    pref_nid = NUMA_NO_NODE, curr_cpu,
+	    nr_running = 0;
 	cpumask_var_t cpus;
 
 	WARN_ON_ONCE(work != &p->cache_work);
@@ -1497,6 +1498,14 @@ static void __no_profile task_cache_work(struct callback_head *work)
 					m_occ = occ;
 					m_cpu = i;
 				}
+
+				rcu_read_lock();
+				cur = rcu_dereference(cpu_rq(i)->curr);
+				if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
+				    cur->mm == mm)
+					nr_running++;
+				rcu_read_unlock();
+
 			}
 
 			/*
@@ -1540,6 +1549,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
 		mm->mm_sched_cpu = m_a_cpu;
 	}
 
+	update_avg(&mm->nr_running_avg, nr_running);
 	free_cpumask_var(cpus);
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 17/19] sched/fair: Disable cache aware scheduling for processes with high thread counts
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (15 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 16/19] sched/fair: Exclude processes with many threads from cache-aware scheduling Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-22 17:21   ` Madadi Vineeth Reddy
  2025-10-11 18:24 ` [PATCH 18/19] sched/fair: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Chen Yu, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Tim Chen,
	Aubrey Li, Zhao Liu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

If the number of active threads within the process
exceeds the number of Cores(divided by SMTs number)
in the LLC, do not enable cache-aware scheduling.
This is because there is a risk of cache contention
within the preferred LLC when too many threads are
present.

Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 79d109f8a09f..6b8eace79eee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1240,6 +1240,18 @@ static inline int pref_llc_idx(struct task_struct *p)
 	return llc_idx(p->preferred_llc);
 }
 
+static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
+{
+	int smt_nr = 1;
+
+#ifdef CONFIG_SCHED_SMT
+	if (sched_smt_active())
+		smt_nr = cpumask_weight(cpu_smt_mask(cpu));
+#endif
+
+	return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
+}
+
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
 	int pref_llc;
@@ -1385,10 +1397,12 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 
 	/*
 	 * If this task hasn't hit task_cache_work() for a while, or it
-	 * has only 1 thread, invalidate its preferred state.
+	 * has only 1 thread, or has too many active threads, invalidate
+	 * its preferred state.
 	 */
 	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
-	    get_nr_threads(p) <= 1) {
+	    get_nr_threads(p) <= 1 ||
+	    exceed_llc_nr(mm, cpu_of(rq))) {
 		if (mm->mm_sched_cpu != -1)
 			mm->mm_sched_cpu = -1;
 	}
@@ -1467,6 +1481,11 @@ static void __no_profile task_cache_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	if (get_nr_threads(p) <= 1) {
+		mm->mm_sched_cpu = -1;
+		return;
+	}
+
 	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
 		return;
 
@@ -9826,6 +9845,10 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
 
+	 /* skip cache aware load balance for single/too many threads */
+	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
+		return mig_unrestricted;
+
 	if (cpus_share_cache(dst_cpu, cpu))
 		to_pref = true;
 	else if (cpus_share_cache(src_cpu, cpu))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 18/19] sched/fair: Avoid cache-aware scheduling for memory-heavy processes
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (16 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 17/19] sched/fair: Disable cache aware scheduling for processes with high thread counts Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-15  6:57   ` kernel test robot
  2025-10-11 18:24 ` [PATCH 19/19] sched/fair: Add user control to adjust the tolerance of cache-aware scheduling Tim Chen
  2025-10-14 12:13 ` [PATCH 00/19] Cache Aware Scheduling Madadi Vineeth Reddy
  19 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Chen Yu, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Tim Chen,
	Aubrey Li, Zhao Liu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.

To mitigate this, estimate a process's memory footprint by comparing
its RSS (anonymous and shared pages) to the size of the LLC. If RSS
exceeds the LLC size, skip cache-aware scheduling.

Note that RSS is only an approximation of the memory footprint.
By default, the comparison is strict, but a later patch will allow
users to provide a hint to adjust this threshold.

According to the test from Adam, some systems do not have shared L3
but with shared L2 as clusters. In this case, the L2 becomes the LLC[1].

Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@os.amperecomputing.com/

Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/cacheinfo.h | 21 ++++++++++------
 kernel/sched/fair.c       | 51 ++++++++++++++++++++++++++++++++++++---
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index c8f4f0a0b874..82d0d59ca0e1 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
 
 const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf);
 
-/*
- * Get the cacheinfo structure for the cache associated with @cpu at
- * level @level.
- * cpuhp lock must be held.
- */
-static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int level)
 {
 	struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
 	int i;
 
-	lockdep_assert_cpus_held();
-
 	for (i = 0; i < ci->num_leaves; i++) {
 		if (ci->info_list[i].level == level) {
 			if (ci->info_list[i].attributes & CACHE_ID)
@@ -136,6 +129,18 @@ static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
 	return NULL;
 }
 
+/*
+ * Get the cacheinfo structure for the cache associated with @cpu at
+ * level @level.
+ * cpuhp lock must be held.
+ */
+static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+{
+	lockdep_assert_cpus_held();
+
+	return _get_cpu_cacheinfo_level(cpu, level);
+}
+
 /*
  * Get the id of the cache associated with @cpu at level @level.
  * cpuhp lock must be held.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b8eace79eee..46dfcd2a01b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1240,6 +1240,38 @@ static inline int pref_llc_idx(struct task_struct *p)
 	return llc_idx(p->preferred_llc);
 }
 
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+{
+	struct cacheinfo *ci;
+	unsigned long rss;
+	unsigned int llc;
+
+	/*
+	 * get_cpu_cacheinfo_level() can not be used
+	 * because it requires the cpu_hotplug_lock
+	 * to be held. Use _get_cpu_cacheinfo_level()
+	 * directly because the 'cpu' can not be
+	 * offlined at the moment.
+	 */
+	ci = _get_cpu_cacheinfo_level(cpu, 3);
+	if (!ci) {
+		/*
+		 * On system without L3 but with shared L2,
+		 * L2 becomes the LLC.
+		 */
+		ci = _get_cpu_cacheinfo_level(cpu, 2);
+		if (!ci)
+			return true;
+	}
+
+	llc = ci->size;
+
+	rss = get_mm_counter(mm, MM_ANONPAGES) +
+		get_mm_counter(mm, MM_SHMEMPAGES);
+
+	return (llc <= (rss * PAGE_SIZE));
+}
+
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
 	int smt_nr = 1;
@@ -1402,7 +1434,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	 */
 	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
 	    get_nr_threads(p) <= 1 ||
-	    exceed_llc_nr(mm, cpu_of(rq))) {
+	    exceed_llc_nr(mm, cpu_of(rq)) ||
+	    exceed_llc_capacity(mm, cpu_of(rq))) {
 		if (mm->mm_sched_cpu != -1)
 			mm->mm_sched_cpu = -1;
 	}
@@ -1486,6 +1519,14 @@ static void __no_profile task_cache_work(struct callback_head *work)
 		return;
 	}
 
+	/*
+	 * Do not check exceed_llc_nr() because
+	 * the active number of threads needs to
+	 * been updated anyway.
+	 */
+	if (exceed_llc_capacity(mm, curr_cpu))
+		return;
+
 	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
 		return;
 
@@ -9845,8 +9886,12 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
 
-	 /* skip cache aware load balance for single/too many threads */
-	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
+	/*
+	 * skip cache aware load balance for single/too many threads
+	 * or large footprint.
+	 */
+	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
+	    exceed_llc_capacity(mm, dst_cpu))
 		return mig_unrestricted;
 
 	if (cpus_share_cache(dst_cpu, cpu))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* [PATCH 19/19] sched/fair: Add user control to adjust the tolerance of cache-aware scheduling
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (17 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 18/19] sched/fair: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
@ 2025-10-11 18:24 ` Tim Chen
  2025-10-29  8:07   ` Aaron Lu
  2025-10-14 12:13 ` [PATCH 00/19] Cache Aware Scheduling Madadi Vineeth Reddy
  19 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-11 18:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Chen Yu, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Tim Chen,
	Aubrey Li, Zhao Liu, Chen Yu, Libo Chen, Adam Li, Tim Chen,
	linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

With sched_cache enabled, the scheduler uses a process's RSS as a
proxy for its LLC footprint to determine if aggregating tasks on the
preferred LLC could cause cache contention. If RSS exceeds the LLC
size, aggregation is skipped. Some workloads with large RSS but small
actual memory footprints may still benefit from aggregation. Since
the kernel cannot efficiently track per-task cache usage (resctrl is
user-space only), userspace can provide a more accurate hint.

Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
users control how strictly RSS limits aggregation. Values range from
0 to 100:

  - 0: Cache-aware scheduling is disabled.
  - 1: Strict; tasks with RSS larger than LLC size are skipped.
  - 100: Aggressive; tasks are aggregated regardless of RSS.

For example, with a 32MB L3 cache:

  - llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
  - llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
    (784GB = (1 + (99 - 1) * 256) * 32MB).

Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
how strictly the number of active threads is considered when doing
cache aware load balance. The number of SMTs is also considered.
High SMT counts reduce the aggregation capacity, preventing excessive
task aggregation on SMT-heavy systems like Power10/Power11.

For example, with 8 Cores/16 CPUs in a L3:

  - llc_aggr_tolerance=1 -> tasks with nr_running > 8 are skipped.
  - llc_aggr_tolerance=99 -> tasks with nr_running > 785 are skipped
    785 = (1 + (99 - 1) * 8).

Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reported-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reported-by: Tingyin Duan <tingyin.duan@gmail.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/debug.c | 56 ++++++++++++++++++++++++++++++--
 kernel/sched/fair.c  | 76 ++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h |  3 ++
 3 files changed, 126 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 57bb04ebbf96..cfcd8b436cc5 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -169,6 +169,50 @@ static const struct file_operations sched_feat_fops = {
 	.release	= single_release,
 };
 
+#ifdef CONFIG_SCHED_CACHE
+#define SCHED_CACHE_CREATE_CONTROL(name)			  \
+static ssize_t sched_cache_write_##name(struct file *filp,	  \
+					const char __user *ubuf,  \
+					size_t cnt, loff_t *ppos) \
+{								  \
+	char buf[16];						  \
+	unsigned int percent;					  \
+	if (cnt > 15)						  \
+		cnt = 15;					  \
+	if (copy_from_user(&buf, ubuf, cnt))			  \
+		return -EFAULT;					  \
+	buf[cnt] = '\0';					  \
+	if (kstrtouint(buf, 10, &percent))			  \
+		return -EINVAL;					  \
+	if (percent > 100)					  \
+		return -EINVAL;					  \
+	llc_##name = percent;					  \
+	*ppos += cnt;						  \
+	return cnt;						  \
+}								  \
+static int sched_cache_show_##name(struct seq_file *m, void *v)	  \
+{								  \
+	seq_printf(m, "%d\n", llc_##name);			  \
+	return 0;						  \
+}								  \
+static int sched_cache_open_##name(struct inode *inode,		  \
+				   struct file *filp)		  \
+{								  \
+	return single_open(filp, sched_cache_show_##name, NULL);  \
+}								  \
+static const struct file_operations sched_cache_fops_##name = {	  \
+	.open		= sched_cache_open_##name,		  \
+	.write		= sched_cache_write_##name,		  \
+	.read		= seq_read,				  \
+	.llseek		= seq_lseek,				  \
+	.release	= single_release,			  \
+}
+
+SCHED_CACHE_CREATE_CONTROL(overload_pct);
+SCHED_CACHE_CREATE_CONTROL(imb_pct);
+SCHED_CACHE_CREATE_CONTROL(aggr_tolerance);
+#endif /* SCHED_CACHE */
+
 static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf,
 				   size_t cnt, loff_t *ppos)
 {
@@ -524,8 +568,16 @@ static __init int sched_init_debug(void)
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_SCHED_CACHE
-	debugfs_create_u32("llc_overload_pct", 0644, debugfs_sched, &llc_overload_pct);
-	debugfs_create_u32("llc_imb_pct", 0644, debugfs_sched, &llc_imb_pct);
+	debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
+			    &sched_cache_fops_overload_pct);
+	debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
+			    &sched_cache_fops_imb_pct);
+	debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
+			    &sched_cache_fops_aggr_tolerance);
+	debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
+			   &llc_epoch_period);
+	debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
+			   &llc_epoch_affinity_timeout);
 #endif
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 46dfcd2a01b3..f9084e2f9ef2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1207,9 +1207,62 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 
 __read_mostly unsigned int llc_overload_pct       = 50;
 __read_mostly unsigned int llc_imb_pct            = 20;
+__read_mostly unsigned int llc_aggr_tolerance     = 1;
+__read_mostly unsigned int llc_epoch_period       = EPOCH_PERIOD;
+__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
 
 DEFINE_STATIC_KEY_FALSE(sched_cache_allowed);
 
+static inline int get_sched_cache_scale(int mul)
+{
+	if (!llc_aggr_tolerance)
+		return 0;
+
+	if (llc_aggr_tolerance == 100)
+		return INT_MAX;
+
+	return (1 + (llc_aggr_tolerance - 1) * mul);
+}
+
+static inline int get_sched_cache_rss_scale(void)
+{
+	/*
+	 * Suppose the L3 size is 32MB. If the
+	 * llc_aggr_tolerance is 1:
+	 * When the RSS is larger than 32MB,
+	 * the process is regarded as exceeding
+	 * the LLC capacity. If the
+	 * llc_aggr_tolerance is 99:
+	 * When the RSS is larger than 784GB,
+	 * the process is regarded as exceeding
+	 * the LLC capacity:
+	 * 784GB = (1 + (99 - 1) * 256) * 32MB
+	 */
+	return get_sched_cache_scale(256);
+}
+
+static inline int get_sched_cache_nr_scale(void)
+{
+	/*
+	 * Suppose the number of Cores in LLC is 8.
+	 * Every core has 2 SMTs.
+	 * If the llc_aggr_tolerance is 1: When the
+	 * nr_running is larger than 8, the process
+	 * is regarded as exceeding the LLC capacity.
+	 * If the llc_aggr_tolerance is 99:
+	 * When the nr_running is larger than 785,
+	 * the process is regarded as exceeding
+	 * the LLC capacity:
+	 * 785 = 1 + (99 - 1) * 8
+	 */
+	return get_sched_cache_scale(1);
+}
+
+static inline int get_sched_cache_cap_scale(void)
+{
+	return (llc_overload_pct / cpu_smt_num_threads);
+}
+
 static inline bool sched_cache_enabled(void)
 {
 	return sched_feat(SCHED_CACHE) &&
@@ -1245,6 +1298,7 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 	struct cacheinfo *ci;
 	unsigned long rss;
 	unsigned int llc;
+	int scale;
 
 	/*
 	 * get_cpu_cacheinfo_level() can not be used
@@ -1269,19 +1323,27 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 	rss = get_mm_counter(mm, MM_ANONPAGES) +
 		get_mm_counter(mm, MM_SHMEMPAGES);
 
-	return (llc <= (rss * PAGE_SIZE));
+	scale = get_sched_cache_rss_scale();
+	if (scale == INT_MAX)
+		return false;
+
+	return ((llc * scale) <= (rss * PAGE_SIZE));
 }
 
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
-	int smt_nr = 1;
+	int smt_nr = 1, scale;
 
 #ifdef CONFIG_SCHED_SMT
 	if (sched_smt_active())
 		smt_nr = cpumask_weight(cpu_smt_mask(cpu));
 #endif
 
-	return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
+	scale = get_sched_cache_nr_scale();
+	if (scale == INT_MAX)
+		return false;
+
+	return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu)));
 }
 
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
@@ -1370,9 +1432,9 @@ static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
 	long delta = now - rq->cpu_epoch_next;
 
 	if (delta > 0) {
-		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		n = (delta + llc_epoch_period - 1) / llc_epoch_period;
 		rq->cpu_epoch += n;
-		rq->cpu_epoch_next += n * EPOCH_PERIOD;
+		rq->cpu_epoch_next += n * llc_epoch_period;
 		__shr_u64(&rq->cpu_runtime, n);
 	}
 
@@ -1432,7 +1494,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	 * has only 1 thread, or has too many active threads, invalidate
 	 * its preferred state.
 	 */
-	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
+	if (epoch - READ_ONCE(mm->mm_sched_epoch) > llc_epoch_affinity_timeout ||
 	    get_nr_threads(p) <= 1 ||
 	    exceed_llc_nr(mm, cpu_of(rq)) ||
 	    exceed_llc_capacity(mm, cpu_of(rq))) {
@@ -9749,7 +9811,7 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
  * (default: ~50%)
  */
 #define fits_llc_capacity(util, max)	\
-	((util) * 100 < (max) * llc_overload_pct)
+	((util) * 100 < (max) * get_sched_cache_cap_scale())
 
 /*
  * The margin used when comparing utilization.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b801d32d5fba..97e8558b0530 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2810,6 +2810,9 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
 #ifdef CONFIG_SCHED_CACHE
 extern unsigned int llc_overload_pct;
 extern unsigned int llc_imb_pct;
+extern unsigned int llc_aggr_tolerance;
+extern unsigned int llc_epoch_period;
+extern unsigned int llc_epoch_affinity_timeout;
 extern struct static_key_false sched_cache_allowed;
 #endif
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/19] sched/fair: Assign preferred LLC ID to processes
  2025-10-11 18:24 ` [PATCH 06/19] sched/fair: Assign preferred LLC ID to processes Tim Chen
@ 2025-10-14  5:16   ` Chen, Yu C
  2025-10-15 11:15     ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Chen, Yu C @ 2025-10-14  5:16 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vern Hao
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Libo Chen,
	Adam Li, Tim Chen, linux-kernel

(Copied the question from Vern as the email seems to not reach LKML)

On 10/14/2025 2:09 AM, Tim Chen wrote:
 > On Mon, 2025-10-13 at 17:10 +0800, vernhao wrote:
 >>
 >> Tim Chen<tim.c.chen@linux.intel.com> wrote:
 >> With cache-aware scheduling enabled, each task is assigned a
 >> preferred LLC ID. This allows quick identification of the LLC domain
 >> where the task prefers to run, similar to numa_preferred_nid in
 >> NUMA balancing.
 >>
 >> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>

[snip]

 >> +
 >> + if (mm->mm_sched_cpu != -1)
 >> + mm_sched_llc = per_cpu(sd_llc_id, mm->mm_sched_cpu);
 >>
 >> In high-concurrency multi-threaded scenarios, not all threads handle
 >> same events, so their hot data in the LLC is not completely shared.
 >> Therefore, if every thread's preferred LLC is migrated to the LLC
 >> pointed to by mm->mm_sched_cpu, this would lead to the incorrect
 >> assumption that all threads prefer the same LLC, thereby intensifying
 >> competition between LLCs.
 >
 > Yes, that's the reason why we stop aggregating to the preferred LLC
 > once the the utilization of the
 > LLC becomes too high relative to the other LLCs.
 >
 > If you know your threads characteristics before hand on which of them
 > share data together, you probably can use cgroup/cpuset
 > from user space to separate out the threads.
 >
 > There's not enough info from occupancy data for OS to group
 > the threads by data sharing. Perhaps an alternative if NUMA balancing
 > is on is to group tasks by their task numa group instead of by mm.
 >
 > That would incur the page scanning overhead etc and make
 > cache aware scheduling be dependent on NUMA balancing.
 >
 >
 >>
 >> So I'm wondering, why not move ‘mm->mm_sched_cpu’ to ‘task_struct’,
 >> so that each thread can individually track its preferred LLC?
 >> What are the losses in doing so?
 >
 > You would need a way to group related tasks together and put them
 > on the same LLC.  Either group them by mm or some other means.
 >

While Vern's use case is common in production environments, switching
to per-task_struct prefer_llc might not aggregate the threads to
dedicated LLCs. It is possible that each thread will stick to its
old LLC because the thread was forked there and the occupancy is
high on that old LLC. As a result, threads are randomly "pinned"
to different LLCs.

The question becomes: how can we figure out the threads that share
data? Can the kernel detect this, or get the hint from user space?

Yes, the numa_group in NUMA load balancing indicates
that several tasks manipulate the same page, which could be an
indicator. Besides, if task A frequently wakes up task B, does it
mean A and B have the potential to share data? Furthermore, if
task A wakes up B via a pipe, it might also indicate that A has
something to share with B. I just wonder if we can introduce a
structure to gather this information together.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 00/19] Cache Aware Scheduling
  2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
                   ` (18 preceding siblings ...)
  2025-10-11 18:24 ` [PATCH 19/19] sched/fair: Add user control to adjust the tolerance of cache-aware scheduling Tim Chen
@ 2025-10-14 12:13 ` Madadi Vineeth Reddy
  2025-10-14 21:48   ` Tim Chen
  19 siblings, 1 reply; 116+ messages in thread
From: Madadi Vineeth Reddy @ 2025-10-14 12:13 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
	Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Chen Yu, Libo Chen,
	Adam Li, Tim Chen, linux-kernel, Madadi Vineeth Reddy

Hi Tim,
Thanks for the patch.

On 11/10/25 23:54, Tim Chen wrote:
> There had been 4 RFC postings of this patch set. We've incorporated
> the feedbacks and comments and now would like to post this patch set
> for consideration of inclusion to mainline. The patches are based on
> the original patch proposed by Peter[1].
> 

[snip]

> The following tunables control under /sys/kernel/debug/sched/ control
> the behavior of cache aware scheduling:
> 
> 1. llc_aggr_tolerance Controls how aggressive we aggregate tasks to
> their preferred LLC, based on a process's RSS size and number of running
> threads.  Processes that have smaller memory footprint and fewer number
> of tasks will benefit better from aggregation.  Varies between 0 to 100
>         0:  Cache aware scheduling is disabled 1:  Process with RSS
>         greater than LLC size,
> 	    or running threads more than number of cpu cores/LLC skip
> 	    aggregation
> 	100:  Aggressive; a process's threads are aggregated regardless of
> 	      RSS or running threads.
> For example, with a 32MB L3 cache 8 cores in L3:
>     llc_aggr_tolerance=1 -> process with RSS > 32MB, or nr_running_avg >
>     8 are skipped.  llc_aggr_tolerance=99 -> process with RSS > 784GB
>     or nr_running_avg > 785 are skipped.  784GB = (1 + (99 - 1) * 256)
>     * 32MB.
>      785  = (1 + (99 - 1) * 8).
> 
> Currently this knob is a global control. Considering that different workloads have
> different requirements for task consolidation, it would be ideal to introduce
> per process control for this knob via prctl in the future.
>  
> 2. llc_overload_pct, llc_imb_pct
> We'll always try to move a task to its preferred LLC if an LLC's average core
> utilization is below llc_overload_pct (default to 50%). Otherwise, the utilization
> of preferred LLC has to be not more than llc_imb_pct (default to 20%) to move a task
> to it. This is to prevent overloading on the preferred LLC.
>  
> 3. llc_epoch_period
> Controls how often the scheduler collect LLC occupancy of a process (default to 10 msec)
>  
> 4. llc_epoch_affinity_timeout
> Detect that if a process has not run for llc_epoch_affinity_timeout (default to 50 msec),
> it loses its cache preference.

How are these default values arrived at? Is it based on some theory or
based on the results of the runs?

>
> Test results:
> The first test platform is a 2 socket Intel Sapphire Rapids with 30
> cores per socket. The DRAM interleaving is enabled in the BIOS so it
> essential has one NUMA node with two last level caches. There are 60
> CPUs associated with each last level cache.
> 
> The second test platform is a AMD Milan. There are 2 Nodes and 64 CPUs
> per node. Each node has 8 CCXs and each CCX has 8 CPUs.
> 
> The third test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs per node.
> Each node has 2 CCXs and each CCX has 16 CPUs.
> 
> [TL;DR]
> Sappire Rapids:
> hackbench shows significant improvement when there is 1 group
> with different number of fd pairs(threads) within this process.
> schbench shows overall wakeup latency improvement.
> ChaCha20-xiangshan shows ~10% throughput improvement. Other
> micro-workloads did not show much difference.
> 
> Milan:
> No obvious difference is observed so far.
> 
> Genoa:
> ChaCha20-xiangshan shows 44% throughput improvement.
> 
> [Sapphire Rapids details]
> 
> [hackbench]
> Hackbench show overall improvement when there is only 1
> group, with different number of fd(pairs). This is the
> expected behavior because this test scenario would benefit
> from cache aware load balance most. Other number of groups
> shows not much difference(using default fd = 20).
> 
>        groups              baseline            sched_cache
> Min       1      37.5960 (   0.00%)     26.4340 (  29.69%)
> Min       3      38.7050 (   0.00%)     38.6920 (   0.03%)
> Min       5      39.4550 (   0.00%)     38.6280 (   2.10%)
> Min       7      51.4270 (   0.00%)     50.6790 (   1.45%)
> Min       12     62.8540 (   0.00%)     63.6590 (  -1.28%)
> Min       16     74.0160 (   0.00%)     74.7480 (  -0.99%)
> Amean     1      38.4768 (   0.00%)     26.7146 *  30.57%*
> Amean     3      39.0750 (   0.00%)     39.5586 (  -1.24%)
> Amean     5      41.5178 (   0.00%)     41.2766 (   0.58%)
> Amean     7      52.1164 (   0.00%)     51.5152 (   1.15%)
> Amean     12     63.9052 (   0.00%)     64.0420 (  -0.21%)
> Amean     16     74.5812 (   0.00%)     75.4318 (  -1.14%)
> BAmean-99 1      38.2027 (   0.00%)     26.5500 (  30.50%)
> BAmean-99 3      38.8725 (   0.00%)     39.2225 (  -0.90%)
> BAmean-99 5      41.1898 (   0.00%)     41.0037 (   0.45%)
> BAmean-99 7      51.8645 (   0.00%)     51.4453 (   0.81%)
> BAmean-99 12     63.6317 (   0.00%)     63.9307 (  -0.47%)
> BAmean-99 16     74.4528 (   0.00%)     75.2113 (  -1.02%)
> 
> [schbench]
> Wakeup Latencies 99.0th improvement is observed.
> 
> threads          baseline             sched_cache          change
> 1                13.80(1.10)          14.80(2.86)          -7.25%
> 2                12.00(1.00)          8.00(2.12)           +33.33%
> 4                9.00(0.00)           5.60(0.89)           +37.78%
> 8                9.00(0.00)           6.40(1.14)           +28.89%
> 16               9.20(0.45)           6.20(0.84)           +32.61%
> 32               9.60(0.55)           7.00(0.71)           +27.08%
> 64               10.80(0.45)          8.40(0.55)           +22.22%
> 128              12.60(0.55)          11.40(0.55)          +9.52%
> 239              14.00(0.00)          14.20(0.45)          -1.43%
> 
> [stream]
> No much difference is observed.
>                              baseline                     sc
> GB/sec copy-2        35.00 (   0.00%)       34.79 (  -0.60%)
> GB/sec scale-2       24.04 (   0.00%)       23.90 (  -0.58%)
> GB/sec add-2         28.98 (   0.00%)       28.92 (  -0.22%)
> GB/sec triad-2       28.32 (   0.00%)       28.31 (  -0.04%)
> 
> [netperf]
> No much difference is observed(consider the stdev).
> 
>          nr_pairs          netperf                netperf
> 
> Hmean     60      1023.44 (   0.00%)     1021.87 (  -0.15%)
> BHmean-99 60      1023.78 (   0.00%)     1022.22 (  -0.15%)
> Hmean     120      792.09 (   0.00%)      793.75 (   0.21%)
> BHmean-99 120      792.36 (   0.00%)      794.04 (   0.21%)
> Hmean     180      513.42 (   0.00%)      513.53 (   0.02%)
> BHmean-99 180      513.81 (   0.00%)      513.80 (  -0.00%)
> Hmean     240      387.09 (   0.00%)      387.33 (   0.06%)
> BHmean-99 240      387.18 (   0.00%)      387.45 (   0.07%)
> Hmean     300      316.04 (   0.00%)      315.68 (  -0.12%)
> BHmean-99 300      316.12 (   0.00%)      315.77 (  -0.11%)
> Hmean     360      496.38 (   0.00%)      455.49 (  -8.24%)
> BHmean-99 360      499.88 (   0.00%)      458.17 (  -8.34%)
> Hmean     420      497.32 (   0.00%)      501.84 (   0.91%)
> BHmean-99 420      499.90 (   0.00%)      504.56 (   0.93%)
> Hmean     480      417.62 (   0.00%)      432.25 (   3.50%)
> BHmean-99 480      419.96 (   0.00%)      434.43 (   3.45%)
> 
> In above case of 360 pairs, although there is a performance
> drop of 8.24%, the corresponding:
> HCoeffVar   360    23.78 (   0.00%)       29.52 ( -24.15%)
> shows that the regression is within the run-to-run variance.
> 
> [Milan details]
> 
> default settings:
> [hackbench]
> 
> Min       1      50.8170 (   0.00%)     51.1890 (  -0.73%)
> Min       3      59.3610 (   0.00%)     58.6080 (   1.27%)
> Min       5      94.9760 (   0.00%)     96.0210 (  -1.10%)
> Min       7     123.3270 (   0.00%)    124.1680 (  -0.68%)
> Min       12    179.2000 (   0.00%)    181.8390 (  -1.47%)
> Min       16    238.8680 (   0.00%)    242.6390 (  -1.58%)
> Amean     1      51.6614 (   0.00%)     51.3630 (   0.58%)
> Amean     3      60.1886 (   0.00%)     59.4542 (   1.22%)
> Amean     5      95.7602 (   0.00%)     96.8338 (  -1.12%)
> Amean     7     124.0332 (   0.00%)    124.4406 (  -0.33%)
> Amean     12    181.0324 (   0.00%)    182.9220 (  -1.04%)
> Amean     16    239.5556 (   0.00%)    243.3556 *  -1.59%*
> BAmean-99 1      51.5335 (   0.00%)     51.3338 (   0.39%)
> BAmean-99 3      59.7848 (   0.00%)     59.0958 (   1.15%)
> BAmean-99 5      95.6698 (   0.00%)     96.5450 (  -0.91%)
> BAmean-99 7     123.8478 (   0.00%)    124.3760 (  -0.43%)
> BAmean-99 12    180.8035 (   0.00%)    182.5135 (  -0.95%)
> BAmean-99 16    239.1933 (   0.00%)    243.0570 (  -1.62%)
> 
> [schbench]
> 
> threads          baseline             sched_cache          change
> 1                12.00(2.00)          11.00(0.71)          +8.33%
> 2                12.40(0.89)          13.80(0.84)          -11.29%
> 4                14.20(0.45)          14.80(0.45)          -4.23%
> 8                16.00(0.00)          15.80(0.45)          +1.25%
> 16               16.00(0.00)          16.00(0.71)          0.00%
> 32               19.40(0.55)          18.60(0.55)          +4.12%
> 63               22.20(0.45)          23.20(0.45)          -4.50%
> 
> [stream]
> No obvious difference is found.
> export STREAM_SIZE=$((128000000))
> 
>                      baseline               sched_cache
> GB/sec copy-16       726.48 (   0.00%)      715.60 (  -1.50%)
> GB/sec scale-16      577.71 (   0.00%)      577.03 (  -0.12%)
> GB/sec add-16        678.85 (   0.00%)      672.87 (  -0.88%)
> GB/sec triad-16      735.52 (   0.00%)      729.05 (  -0.88%)
> 
> 
> [netperf]
> No much difference is observed.
> 
>          nr_pairs          baseline           sched_cache
> Hmean     32       755.98 (   0.00%)      755.17 (  -0.11%)
> BHmean-99 32       756.42 (   0.00%)      755.40 (  -0.13%)
> Hmean     64       677.38 (   0.00%)      669.75 (  -1.13%)
> BHmean-99 64       677.50 (   0.00%)      669.86 (  -1.13%)
> Hmean     96       498.52 (   0.00%)      496.73 (  -0.36%)
> BHmean-99 96       498.69 (   0.00%)      496.93 (  -0.35%)
> Hmean     128      604.38 (   0.00%)      604.22 (  -0.03%)
> BHmean-99 128      604.87 (   0.00%)      604.87 (   0.00%)
> Hmean     160      471.67 (   0.00%)      468.29 (  -0.72%)
> BHmean-99 160      474.34 (   0.00%)      471.05 (  -0.69%)
> Hmean     192      381.18 (   0.00%)      384.88 (   0.97%)
> BHmean-99 192      383.30 (   0.00%)      386.82 (   0.92%)
> Hmean     224      327.79 (   0.00%)      326.05 (  -0.53%)
> BHmean-99 224      329.85 (   0.00%)      327.87 (  -0.60%)
> Hmean     256      284.61 (   0.00%)      300.52 (   5.59%)
> BHmean-99 256      286.41 (   0.00%)      302.06 (   5.47%)
> 
> [Genoa details]
> [ChaCha20-xiangshan]
> ChaCha20-xiangshan is a simple benchmark using a static build of an
> 8-thread Verilator of XiangShan(RISC-V). The README file can be
> found here[2]. The score depends on how aggressive the user set the
> /sys/kernel/debug/sched/llc_aggr_tolerance. Using the default values,
> there is no much difference observed. While setting the
> /sys/kernel/debug/sched/llc_aggr_tolerance to 100, 44% improvment is
> observed.
> 
> baseline:
> Host time spent: 50,868ms
> 
> sched_cache:
> Host time spent: 28,349ms
> 
> The time has been reduced by 44%.

Milan showed no improvement across all benchmarks, which could be due to the 
CCX topology (8 CCXs × 8 CPUs) where the LLC domain is too small for this
optimization to be effective. Moreover there could be overhead due to additional
computations.

ChaCha20-xiangshan improvement in Genoa when llc_aggr_tolerance is set to 100 seems
due to having relatively lesser thread count. Please provide the numbers
with default values too. Would like to know numbers on varying loads.

In Power 10 and Power 11, the LLC size is 4 threads which is even smaller. Not
expecting improvements here but will run some workloads and share the data.

Not gone through the entire series yet but are the situations like say in two
NUMA system, if a task's preferred LLC is on the wrong NUMA node for its memory,
which takes precedence? 

Also, what about the workloads that don't share data like stress-ng? It will
be good to make sure that most other workloads don't suffer. As mentioned,
per process knob for llc_aggr_tolerance could help.

Thanks,
Madadi Vineeth Reddy

> 
> Thanks to everyone who participated and provided valuable suggestions for
> the previous versions. Your comments and tests on the latest version are
> also greatly appreciated in advance.
> 
> Tim
> 
> [1] https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> 
> [2] https://github.com/yu-chen-surf/chacha20-xiangshan/blob/master/README.eng.md
> 
> RFC v4:
> [3] https://lore.kernel.org/all/cover.1754712565.git.tim.c.chen@linux.intel.com/
> 
> RFC v3
> [4] https://lore.kernel.org/all/cover.1750268218.git.tim.c.chen@linux.intel.com/
> 
> RFC v2:
> [5] https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/
> 
> 
> Chen Yu (7):
>   sched/fair: Record per-LLC utilization to guide cache-aware scheduling
>     decisions
>   sched/fair: Introduce helper functions to enforce LLC migration policy
>   sched/fair: Introduce a static key to enable cache aware only for
>     multi LLCs
>   sched/fair: Exclude processes with many threads from cache-aware
>     scheduling
>   sched/fair: Disable cache aware scheduling for processes with high
>     thread counts
>   sched/fair: Avoid cache-aware scheduling for memory-heavy processes
>   sched/fair: Add user control to adjust the tolerance of cache-aware
>     scheduling
> 
> Peter Zijlstra (Intel) (1):
>   sched/fair: Add infrastructure for cache-aware load balancing
> 
> Tim Chen (11):
>   sched/fair: Add LLC index mapping for CPUs
>   sched/fair: Assign preferred LLC ID to processes
>   sched/fair: Track LLC-preferred tasks per runqueue
>   sched/fair: Introduce per runqueue task LLC preference counter
>   sched/fair: Count tasks prefering each LLC in a sched group
>   sched/fair: Prioritize tasks preferring destination LLC during
>     balancing
>   sched/fair: Identify busiest sched_group for LLC-aware load balancing
>   sched/fair: Add migrate_llc_task migration type for cache-aware
>     balancing
>   sched/fair: Handle moving single tasks to/from their preferred LLC
>   sched/fair: Consider LLC preference when selecting tasks for load
>     balancing
>   sched/fair: Respect LLC preference in task migration and detach
> 
>  include/linux/cacheinfo.h      |   21 +-
>  include/linux/mm_types.h       |   45 ++
>  include/linux/sched.h          |    5 +
>  include/linux/sched/topology.h |    4 +
>  include/linux/threads.h        |   10 +
>  init/Kconfig                   |   20 +
>  init/init_task.c               |    3 +
>  kernel/fork.c                  |    6 +
>  kernel/sched/core.c            |   18 +
>  kernel/sched/debug.c           |   56 ++
>  kernel/sched/fair.c            | 1022 +++++++++++++++++++++++++++++++-
>  kernel/sched/features.h        |    1 +
>  kernel/sched/sched.h           |   27 +
>  kernel/sched/topology.c        |   61 +-
>  14 files changed, 1283 insertions(+), 16 deletions(-)
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing
  2025-10-11 18:24 ` [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing Tim Chen
@ 2025-10-14 19:12   ` Madadi Vineeth Reddy
  2025-10-15  4:54     ` Chen, Yu C
  2025-10-15 11:54     ` Peter Zijlstra
  2025-10-23  7:26   ` kernel test robot
  2025-10-27  4:47   ` K Prateek Nayak
  2 siblings, 2 replies; 116+ messages in thread
From: Madadi Vineeth Reddy @ 2025-10-14 19:12 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
	Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Chen Yu, Libo Chen,
	Adam Li, Tim Chen, linux-kernel, Madadi Vineeth Reddy

On 11/10/25 23:54, Tim Chen wrote:
> From: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> 
> Cache-aware load balancing aims to aggregate tasks with potential
> shared resources into the same cache domain. This approach enhances
> cache locality, thereby optimizing system performance by reducing
> cache misses and improving data access efficiency.
> 

[snip]

> +static void get_scan_cpumasks(cpumask_var_t cpus, int cache_cpu,
> +			      int pref_nid, int curr_cpu)
> +{
> +#ifdef CONFIG_NUMA_BALANCING
> +	/* First honor the task's preferred node. */
> +	if (pref_nid != NUMA_NO_NODE)
> +		cpumask_or(cpus, cpus, cpumask_of_node(pref_nid));
> +#endif
> +
> +	/* Next honor the task's cache CPU if it is not included. */
> +	if (cache_cpu != -1 && !cpumask_test_cpu(cache_cpu, cpus))
> +		cpumask_or(cpus, cpus,
> +			   cpumask_of_node(cpu_to_node(cache_cpu)));
> +
> +	/*
> +	 * Lastly make sure that the task's current running node is
> +	 * considered.
> +	 */
> +	if (!cpumask_test_cpu(curr_cpu, cpus))
> +		cpumask_or(cpus, cpus, cpumask_of_node(cpu_to_node(curr_cpu)));
> +}
> +
> +static void __no_profile task_cache_work(struct callback_head *work)
> +{
> +	struct task_struct *p = current;
> +	struct mm_struct *mm = p->mm;
> +	unsigned long m_a_occ = 0;
> +	unsigned long curr_m_a_occ = 0;
> +	int cpu, m_a_cpu = -1, cache_cpu,
> +	    pref_nid = NUMA_NO_NODE, curr_cpu;
> +	cpumask_var_t cpus;
> +
> +	WARN_ON_ONCE(work != &p->cache_work);
> +
> +	work->next = work;
> +
> +	if (p->flags & PF_EXITING)
> +		return;
> +
> +	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
> +		return;
> +
> +	curr_cpu = task_cpu(p);
> +	cache_cpu = mm->mm_sched_cpu;
> +#ifdef CONFIG_NUMA_BALANCING
> +	if (static_branch_likely(&sched_numa_balancing))
> +		pref_nid = p->numa_preferred_nid;
> +#endif
> +
> +	scoped_guard (cpus_read_lock) {
> +		get_scan_cpumasks(cpus, cache_cpu,
> +				  pref_nid, curr_cpu);
> +

IIUC, `get_scan_cpumasks` ORs together the preferred NUMA node, cache CPU's node,
and current CPU's node. This could result in scanning multiple nodes, not preferring
the NUMA preferred node.

> +		for_each_cpu(cpu, cpus) {
> +			/* XXX sched_cluster_active */
> +			struct sched_domain *sd = per_cpu(sd_llc, cpu);
> +			unsigned long occ, m_occ = 0, a_occ = 0;
> +			int m_cpu = -1, i;
> +
> +			if (!sd)
> +				continue;
> +
> +			for_each_cpu(i, sched_domain_span(sd)) {
> +				occ = fraction_mm_sched(cpu_rq(i),
> +							per_cpu_ptr(mm->pcpu_sched, i));
> +				a_occ += occ;
> +				if (occ > m_occ) {
> +					m_occ = occ;
> +					m_cpu = i;
> +				}
> +			}
> +
> +			/*
> +			 * Compare the accumulated occupancy of each LLC. The
> +			 * reason for using accumulated occupancy rather than average
> +			 * per CPU occupancy is that it works better in asymmetric LLC
> +			 * scenarios.
> +			 * For example, if there are 2 threads in a 4CPU LLC and 3
> +			 * threads in an 8CPU LLC, it might be better to choose the one
> +			 * with 3 threads. However, this would not be the case if the
> +			 * occupancy is divided by the number of CPUs in an LLC (i.e.,
> +			 * if average per CPU occupancy is used).
> +			 * Besides, NUMA balancing fault statistics behave similarly:
> +			 * the total number of faults per node is compared rather than
> +			 * the average number of faults per CPU. This strategy is also
> +			 * followed here.
> +			 */
> +			if (a_occ > m_a_occ) {
> +				m_a_occ = a_occ;
> +				m_a_cpu = m_cpu;
> +			}
> +
> +			if (llc_id(cpu) == llc_id(mm->mm_sched_cpu))
> +				curr_m_a_occ = a_occ;
> +
> +			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
> +		}

This means NUMA preference has no effect on the selection, except in the
unlikely case of exactly equal occupancy across LLCs on different nodes
(where iteration order determines the winner).

How does it handle when cache locality and memory locality conflict?
Shouldn't numa preferred node get preference? Also scanning multiple
nodes add overhead, so can restricting it to numa preferred node be
better and scan others only when there is no numa preferred node?

Let me know if I am missing anything.

Thanks,
Madadi Vineeth Reddy


> +	}
> +
> +	if (m_a_occ > (2 * curr_m_a_occ)) {
> +		/*
> +		 * Avoid switching mm_sched_cpu too fast.
> +		 * The reason to choose 2X is because:
> +		 * 1. It is better to keep the preferred LLC stable,
> +		 *    rather than changing it frequently and cause migrations
> +		 * 2. 2X means the new preferred LLC has at least 1 more
> +		 *    busy CPU than the old one(200% vs 100%, eg)
> +		 * 3. 2X is chosen based on test results, as it delivers
> +		 *    the optimal performance gain so far.
> +		 */
> +		mm->mm_sched_cpu = m_a_cpu;
> +	}
> +
> +	free_cpumask_var(cpus);
> +}
> +
> +void init_sched_mm(struct task_struct *p)
> +{
> +	struct callback_head *work = &p->cache_work;
> +
> +	init_task_work(work, task_cache_work);
> +	work->next = work;
> +}
> +
> +#else
> +
> +static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
> +				    s64 delta_exec) { }
> +
> +void init_sched_mm(struct task_struct *p) { }
> +
> +static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
> +
> +#endif
> +
>  /*
>   * Used by other classes to account runtime.
>   */
> @@ -13031,6 +13317,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  	if (static_branch_unlikely(&sched_numa_balancing))
>  		task_tick_numa(rq, curr);
>  
> +	task_tick_cache(rq, curr);
> +
>  	update_misfit_status(curr, rq);
>  	check_update_overutilized_status(task_rq(curr));
>  
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 3c12d9f93331..d2af7bfd36bf 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -87,6 +87,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>   */
>  SCHED_FEAT(SIS_UTIL, true)
>  
> +SCHED_FEAT(SCHED_CACHE, true)
>  /*
>   * Issue a WARN when we do multiple update_rq_clock() calls
>   * in a single rq->lock section. Default disabled because the
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index be9745d104f7..2ded8d3d0ecc 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1166,6 +1166,12 @@ struct rq {
>  	u64			clock_pelt_idle_copy;
>  	u64			clock_idle_copy;
>  #endif
> +#ifdef CONFIG_SCHED_CACHE
> +	raw_spinlock_t		cpu_epoch_lock ____cacheline_aligned;
> +	u64			cpu_runtime;
> +	unsigned long		cpu_epoch;
> +	unsigned long		cpu_epoch_next;
> +#endif
>  
>  	atomic_t		nr_iowait;
>  
> @@ -3790,6 +3796,8 @@ static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) { }
>  static inline void init_sched_mm_cid(struct task_struct *t) { }
>  #endif /* !CONFIG_SCHED_MM_CID */
>  
> +extern void init_sched_mm(struct task_struct *p);
> +
>  extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
>  extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
>  static inline


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 00/19] Cache Aware Scheduling
  2025-10-14 12:13 ` [PATCH 00/19] Cache Aware Scheduling Madadi Vineeth Reddy
@ 2025-10-14 21:48   ` Tim Chen
  2025-10-15  5:38     ` Chen, Yu C
  0 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-14 21:48 UTC (permalink / raw)
  To: Madadi Vineeth Reddy
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
	Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Chen Yu, Libo Chen,
	Adam Li, Tim Chen, linux-kernel

On Tue, 2025-10-14 at 17:43 +0530, Madadi Vineeth Reddy wrote:
> Hi Tim,
> Thanks for the patch.
> 
> On 11/10/25 23:54, Tim Chen wrote:
> > There had been 4 RFC postings of this patch set. We've incorporated
> > the feedbacks and comments and now would like to post this patch set
> > for consideration of inclusion to mainline. The patches are based on
> > the original patch proposed by Peter[1].
> > 
> 
> [snip]
> 
> > The following tunables control under /sys/kernel/debug/sched/ control
> > the behavior of cache aware scheduling:
> > 
> > 1. llc_aggr_tolerance Controls how aggressive we aggregate tasks to
> > their preferred LLC, based on a process's RSS size and number of running
> > threads.  Processes that have smaller memory footprint and fewer number
> > of tasks will benefit better from aggregation.  Varies between 0 to 100
> >         0:  Cache aware scheduling is disabled 1:  Process with RSS
> >         greater than LLC size,
> > 	    or running threads more than number of cpu cores/LLC skip
> > 	    aggregation
> > 	100:  Aggressive; a process's threads are aggregated regardless of
> > 	      RSS or running threads.
> > For example, with a 32MB L3 cache 8 cores in L3:
> >     llc_aggr_tolerance=1 -> process with RSS > 32MB, or nr_running_avg >
> >     8 are skipped.  llc_aggr_tolerance=99 -> process with RSS > 784GB
> >     or nr_running_avg > 785 are skipped.  784GB = (1 + (99 - 1) * 256)
> >     * 32MB.
> >      785  = (1 + (99 - 1) * 8).
> > 
> > Currently this knob is a global control. Considering that different workloads have
> > different requirements for task consolidation, it would be ideal to introduce
> > per process control for this knob via prctl in the future.
> >  
> > 2. llc_overload_pct, llc_imb_pct
> > We'll always try to move a task to its preferred LLC if an LLC's average core
> > utilization is below llc_overload_pct (default to 50%). Otherwise, the utilization
> > of preferred LLC has to be not more than llc_imb_pct (default to 20%) to move a task
> > to it. This is to prevent overloading on the preferred LLC.
> >  
> > 3. llc_epoch_period
> > Controls how often the scheduler collect LLC occupancy of a process (default to 10 msec)
> >  
> > 4. llc_epoch_affinity_timeout
> > Detect that if a process has not run for llc_epoch_affinity_timeout (default to 50 msec),
> > it loses its cache preference.
> 
> How are these default values arrived at? Is it based on some theory or
> based on the results of the runs?

Right now the default value of llc_aggr_tolerance is fairly conservative.
We make sure that we don't cause regressions to workloads we tested.

Knobs like llc_overload_pct, llc_imb_pct are actually chosen from
Len's Yogini micro-benchmark experiments we did that gave good aggregation
without overloading LLC.

llc_epoch_period and llc_epoch_affinity_timeout are from Peter's
original patch that seem to work fairly well so we left it as is. 

> 
> > 
> > Test results:
> > The first test platform is a 2 socket Intel Sapphire Rapids with 30
> > cores per socket. The DRAM interleaving is enabled in the BIOS so it
> > essential has one NUMA node with two last level caches. There are 60
> > CPUs associated with each last level cache.
> > 
> > The second test platform is a AMD Milan. There are 2 Nodes and 64 CPUs
> > per node. Each node has 8 CCXs and each CCX has 8 CPUs.
> > 
> > The third test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs per node.
> > Each node has 2 CCXs and each CCX has 16 CPUs.
> > 
> > [TL;DR]
> > Sappire Rapids:
> > hackbench shows significant improvement when there is 1 group
> > with different number of fd pairs(threads) within this process.
> > schbench shows overall wakeup latency improvement.
> > ChaCha20-xiangshan shows ~10% throughput improvement. Other
> > micro-workloads did not show much difference.
> > 
> > Milan:
> > No obvious difference is observed so far.
> > 
> > Genoa:
> > ChaCha20-xiangshan shows 44% throughput improvement.
> > 
> > [Sapphire Rapids details]
> > 
> > [hackbench]
> > Hackbench show overall improvement when there is only 1
> > group, with different number of fd(pairs). This is the
> > expected behavior because this test scenario would benefit
> > from cache aware load balance most. Other number of groups
> > shows not much difference(using default fd = 20).
> > 
> >        groups              baseline            sched_cache
> > Min       1      37.5960 (   0.00%)     26.4340 (  29.69%)
> > Min       3      38.7050 (   0.00%)     38.6920 (   0.03%)
> > Min       5      39.4550 (   0.00%)     38.6280 (   2.10%)
> > Min       7      51.4270 (   0.00%)     50.6790 (   1.45%)
> > Min       12     62.8540 (   0.00%)     63.6590 (  -1.28%)
> > Min       16     74.0160 (   0.00%)     74.7480 (  -0.99%)
> > Amean     1      38.4768 (   0.00%)     26.7146 *  30.57%*
> > Amean     3      39.0750 (   0.00%)     39.5586 (  -1.24%)
> > Amean     5      41.5178 (   0.00%)     41.2766 (   0.58%)
> > Amean     7      52.1164 (   0.00%)     51.5152 (   1.15%)
> > Amean     12     63.9052 (   0.00%)     64.0420 (  -0.21%)
> > Amean     16     74.5812 (   0.00%)     75.4318 (  -1.14%)
> > BAmean-99 1      38.2027 (   0.00%)     26.5500 (  30.50%)
> > BAmean-99 3      38.8725 (   0.00%)     39.2225 (  -0.90%)
> > BAmean-99 5      41.1898 (   0.00%)     41.0037 (   0.45%)
> > BAmean-99 7      51.8645 (   0.00%)     51.4453 (   0.81%)
> > BAmean-99 12     63.6317 (   0.00%)     63.9307 (  -0.47%)
> > BAmean-99 16     74.4528 (   0.00%)     75.2113 (  -1.02%)
> > 
> > [schbench]
> > Wakeup Latencies 99.0th improvement is observed.
> > 
> > threads          baseline             sched_cache          change
> > 1                13.80(1.10)          14.80(2.86)          -7.25%
> > 2                12.00(1.00)          8.00(2.12)           +33.33%
> > 4                9.00(0.00)           5.60(0.89)           +37.78%
> > 8                9.00(0.00)           6.40(1.14)           +28.89%
> > 16               9.20(0.45)           6.20(0.84)           +32.61%
> > 32               9.60(0.55)           7.00(0.71)           +27.08%
> > 64               10.80(0.45)          8.40(0.55)           +22.22%
> > 128              12.60(0.55)          11.40(0.55)          +9.52%
> > 239              14.00(0.00)          14.20(0.45)          -1.43%
> > 
> > [stream]
> > No much difference is observed.
> >                              baseline                     sc
> > GB/sec copy-2        35.00 (   0.00%)       34.79 (  -0.60%)
> > GB/sec scale-2       24.04 (   0.00%)       23.90 (  -0.58%)
> > GB/sec add-2         28.98 (   0.00%)       28.92 (  -0.22%)
> > GB/sec triad-2       28.32 (   0.00%)       28.31 (  -0.04%)
> > 
> > [netperf]
> > No much difference is observed(consider the stdev).
> > 
> >          nr_pairs          netperf                netperf
> > 
> > Hmean     60      1023.44 (   0.00%)     1021.87 (  -0.15%)
> > BHmean-99 60      1023.78 (   0.00%)     1022.22 (  -0.15%)
> > Hmean     120      792.09 (   0.00%)      793.75 (   0.21%)
> > BHmean-99 120      792.36 (   0.00%)      794.04 (   0.21%)
> > Hmean     180      513.42 (   0.00%)      513.53 (   0.02%)
> > BHmean-99 180      513.81 (   0.00%)      513.80 (  -0.00%)
> > Hmean     240      387.09 (   0.00%)      387.33 (   0.06%)
> > BHmean-99 240      387.18 (   0.00%)      387.45 (   0.07%)
> > Hmean     300      316.04 (   0.00%)      315.68 (  -0.12%)
> > BHmean-99 300      316.12 (   0.00%)      315.77 (  -0.11%)
> > Hmean     360      496.38 (   0.00%)      455.49 (  -8.24%)
> > BHmean-99 360      499.88 (   0.00%)      458.17 (  -8.34%)
> > Hmean     420      497.32 (   0.00%)      501.84 (   0.91%)
> > BHmean-99 420      499.90 (   0.00%)      504.56 (   0.93%)
> > Hmean     480      417.62 (   0.00%)      432.25 (   3.50%)
> > BHmean-99 480      419.96 (   0.00%)      434.43 (   3.45%)
> > 
> > In above case of 360 pairs, although there is a performance
> > drop of 8.24%, the corresponding:
> > HCoeffVar   360    23.78 (   0.00%)       29.52 ( -24.15%)
> > shows that the regression is within the run-to-run variance.
> > 
> > [Milan details]
> > 
> > default settings:
> > [hackbench]
> > 
> > Min       1      50.8170 (   0.00%)     51.1890 (  -0.73%)
> > Min       3      59.3610 (   0.00%)     58.6080 (   1.27%)
> > Min       5      94.9760 (   0.00%)     96.0210 (  -1.10%)
> > Min       7     123.3270 (   0.00%)    124.1680 (  -0.68%)
> > Min       12    179.2000 (   0.00%)    181.8390 (  -1.47%)
> > Min       16    238.8680 (   0.00%)    242.6390 (  -1.58%)
> > Amean     1      51.6614 (   0.00%)     51.3630 (   0.58%)
> > Amean     3      60.1886 (   0.00%)     59.4542 (   1.22%)
> > Amean     5      95.7602 (   0.00%)     96.8338 (  -1.12%)
> > Amean     7     124.0332 (   0.00%)    124.4406 (  -0.33%)
> > Amean     12    181.0324 (   0.00%)    182.9220 (  -1.04%)
> > Amean     16    239.5556 (   0.00%)    243.3556 *  -1.59%*
> > BAmean-99 1      51.5335 (   0.00%)     51.3338 (   0.39%)
> > BAmean-99 3      59.7848 (   0.00%)     59.0958 (   1.15%)
> > BAmean-99 5      95.6698 (   0.00%)     96.5450 (  -0.91%)
> > BAmean-99 7     123.8478 (   0.00%)    124.3760 (  -0.43%)
> > BAmean-99 12    180.8035 (   0.00%)    182.5135 (  -0.95%)
> > BAmean-99 16    239.1933 (   0.00%)    243.0570 (  -1.62%)
> > 
> > [schbench]
> > 
> > threads          baseline             sched_cache          change
> > 1                12.00(2.00)          11.00(0.71)          +8.33%
> > 2                12.40(0.89)          13.80(0.84)          -11.29%
> > 4                14.20(0.45)          14.80(0.45)          -4.23%
> > 8                16.00(0.00)          15.80(0.45)          +1.25%
> > 16               16.00(0.00)          16.00(0.71)          0.00%
> > 32               19.40(0.55)          18.60(0.55)          +4.12%
> > 63               22.20(0.45)          23.20(0.45)          -4.50%
> > 
> > [stream]
> > No obvious difference is found.
> > export STREAM_SIZE=$((128000000))
> > 
> >                      baseline               sched_cache
> > GB/sec copy-16       726.48 (   0.00%)      715.60 (  -1.50%)
> > GB/sec scale-16      577.71 (   0.00%)      577.03 (  -0.12%)
> > GB/sec add-16        678.85 (   0.00%)      672.87 (  -0.88%)
> > GB/sec triad-16      735.52 (   0.00%)      729.05 (  -0.88%)
> > 
> > 
> > [netperf]
> > No much difference is observed.
> > 
> >          nr_pairs          baseline           sched_cache
> > Hmean     32       755.98 (   0.00%)      755.17 (  -0.11%)
> > BHmean-99 32       756.42 (   0.00%)      755.40 (  -0.13%)
> > Hmean     64       677.38 (   0.00%)      669.75 (  -1.13%)
> > BHmean-99 64       677.50 (   0.00%)      669.86 (  -1.13%)
> > Hmean     96       498.52 (   0.00%)      496.73 (  -0.36%)
> > BHmean-99 96       498.69 (   0.00%)      496.93 (  -0.35%)
> > Hmean     128      604.38 (   0.00%)      604.22 (  -0.03%)
> > BHmean-99 128      604.87 (   0.00%)      604.87 (   0.00%)
> > Hmean     160      471.67 (   0.00%)      468.29 (  -0.72%)
> > BHmean-99 160      474.34 (   0.00%)      471.05 (  -0.69%)
> > Hmean     192      381.18 (   0.00%)      384.88 (   0.97%)
> > BHmean-99 192      383.30 (   0.00%)      386.82 (   0.92%)
> > Hmean     224      327.79 (   0.00%)      326.05 (  -0.53%)
> > BHmean-99 224      329.85 (   0.00%)      327.87 (  -0.60%)
> > Hmean     256      284.61 (   0.00%)      300.52 (   5.59%)
> > BHmean-99 256      286.41 (   0.00%)      302.06 (   5.47%)
> > 
> > [Genoa details]
> > [ChaCha20-xiangshan]
> > ChaCha20-xiangshan is a simple benchmark using a static build of an
> > 8-thread Verilator of XiangShan(RISC-V). The README file can be
> > found here[2]. The score depends on how aggressive the user set the
> > /sys/kernel/debug/sched/llc_aggr_tolerance. Using the default values,
> > there is no much difference observed. While setting the
> > /sys/kernel/debug/sched/llc_aggr_tolerance to 100, 44% improvment is
> > observed.
> > 
> > baseline:
> > Host time spent: 50,868ms
> > 
> > sched_cache:
> > Host time spent: 28,349ms
> > 
> > The time has been reduced by 44%.
> 
> Milan showed no improvement across all benchmarks, which could be due to the 
> CCX topology (8 CCXs × 8 CPUs) where the LLC domain is too small for this
> optimization to be effective. Moreover there could be overhead due to additional
> computations.
> 
> ChaCha20-xiangshan improvement in Genoa when llc_aggr_tolerance is set to 100 seems
> due to having relatively lesser thread count. Please provide the numbers
> with default values too. Would like to know numbers on varying loads.

I'll ask Chen Yu who did the Xiangshan experiments if he has those numbers.

> 
> In Power 10 and Power 11, the LLC size is 4 threads which is even smaller. Not
> expecting improvements here but will run some workloads and share the data.
> 
> Not gone through the entire series yet but are the situations like say in two
> NUMA system, if a task's preferred LLC is on the wrong NUMA node for its memory,
> which takes precedence? 

We take preferred NUMA node in the consideration but we do not force task to
go to the preferred node.

I remembered initially we limited the consideration to only LLCs in the
preferred node. But we encountered regressions in hackbench and schbench,
because when the preferred node don't have any occupancy resulting in preferred LLC
to be set to -1 (no preference), and resulted in extra task migrations.
And also the preferred node for hackbench and schbench was volatile
as they have small memory footprint.  Chen Yu, please chime in if there
were other reasons you remembered.

We'll need to revisit this part of the code to take care of such
corner case. I think ideally we should move tasks to the least loaded LLC
in the preferred node (even if no LLCs have occupancy in the preferred node),
as long as preferred NUMA node don't changes too often.


> 
> Also, what about the workloads that don't share data like stress-ng? 
> 

We can test those.  Ideally the controls to prevent over aggregation to preferred LLC
would keep stress-ng happy.

> It will
> be good to make sure that most other workloads don't suffer. As mentioned,
> per process knob for llc_aggr_tolerance could help.

Agree. We are planning to add per process knob for the next version.  One thought is to use
prctl. Any other suggestions are welcome.

Tim

> 
> Thanks,
> Madadi Vineeth Reddy
> 
> > 
> > Thanks to everyone who participated and provided valuable suggestions for
> > the previous versions. Your comments and tests on the latest version are
> > also greatly appreciated in advance.
> > 
> > Tim
> > 
> > [1] https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> > 
> > [2] https://github.com/yu-chen-surf/chacha20-xiangshan/blob/master/README.eng.md
> > 
> > RFC v4:
> > [3] https://lore.kernel.org/all/cover.1754712565.git.tim.c.chen@linux.intel.com/
> > 
> > RFC v3
> > [4] https://lore.kernel.org/all/cover.1750268218.git.tim.c.chen@linux.intel.com/
> > 
> > RFC v2:
> > [5] https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/
> > 
> > 
> > Chen Yu (7):
> >   sched/fair: Record per-LLC utilization to guide cache-aware scheduling
> >     decisions
> >   sched/fair: Introduce helper functions to enforce LLC migration policy
> >   sched/fair: Introduce a static key to enable cache aware only for
> >     multi LLCs
> >   sched/fair: Exclude processes with many threads from cache-aware
> >     scheduling
> >   sched/fair: Disable cache aware scheduling for processes with high
> >     thread counts
> >   sched/fair: Avoid cache-aware scheduling for memory-heavy processes
> >   sched/fair: Add user control to adjust the tolerance of cache-aware
> >     scheduling
> > 
> > Peter Zijlstra (Intel) (1):
> >   sched/fair: Add infrastructure for cache-aware load balancing
> > 
> > Tim Chen (11):
> >   sched/fair: Add LLC index mapping for CPUs
> >   sched/fair: Assign preferred LLC ID to processes
> >   sched/fair: Track LLC-preferred tasks per runqueue
> >   sched/fair: Introduce per runqueue task LLC preference counter
> >   sched/fair: Count tasks prefering each LLC in a sched group
> >   sched/fair: Prioritize tasks preferring destination LLC during
> >     balancing
> >   sched/fair: Identify busiest sched_group for LLC-aware load balancing
> >   sched/fair: Add migrate_llc_task migration type for cache-aware
> >     balancing
> >   sched/fair: Handle moving single tasks to/from their preferred LLC
> >   sched/fair: Consider LLC preference when selecting tasks for load
> >     balancing
> >   sched/fair: Respect LLC preference in task migration and detach
> > 
> >  include/linux/cacheinfo.h      |   21 +-
> >  include/linux/mm_types.h       |   45 ++
> >  include/linux/sched.h          |    5 +
> >  include/linux/sched/topology.h |    4 +
> >  include/linux/threads.h        |   10 +
> >  init/Kconfig                   |   20 +
> >  init/init_task.c               |    3 +
> >  kernel/fork.c                  |    6 +
> >  kernel/sched/core.c            |   18 +
> >  kernel/sched/debug.c           |   56 ++
> >  kernel/sched/fair.c            | 1022 +++++++++++++++++++++++++++++++-
> >  kernel/sched/features.h        |    1 +
> >  kernel/sched/sched.h           |   27 +
> >  kernel/sched/topology.c        |   61 +-
> >  14 files changed, 1283 insertions(+), 16 deletions(-)
> > 
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing
  2025-10-14 19:12   ` Madadi Vineeth Reddy
@ 2025-10-15  4:54     ` Chen, Yu C
  2025-10-15 19:32       ` Tim Chen
  2025-10-15 11:54     ` Peter Zijlstra
  1 sibling, 1 reply; 116+ messages in thread
From: Chen, Yu C @ 2025-10-15  4:54 UTC (permalink / raw)
  To: Madadi Vineeth Reddy, Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
	Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Tim Chen,
	linux-kernel, haoxing990

On 10/15/2025 3:12 AM, Madadi Vineeth Reddy wrote:
> On 11/10/25 23:54, Tim Chen wrote:
>> From: "Peter Zijlstra (Intel)" <peterz@infradead.org>
>>
>> Cache-aware load balancing aims to aggregate tasks with potential
>> shared resources into the same cache domain. This approach enhances
>> cache locality, thereby optimizing system performance by reducing
>> cache misses and improving data access efficiency.
>>

[snip]

>> +static void __no_profile task_cache_work(struct callback_head *work)
>> +{
>> +	struct task_struct *p = current;
>> +	struct mm_struct *mm = p->mm;
>> +	unsigned long m_a_occ = 0;
>> +	unsigned long curr_m_a_occ = 0;
>> +	int cpu, m_a_cpu = -1, cache_cpu,
>> +	    pref_nid = NUMA_NO_NODE, curr_cpu;
>> +	cpumask_var_t cpus;
>> +
>> +	WARN_ON_ONCE(work != &p->cache_work);
>> +
>> +	work->next = work;
>> +
>> +	if (p->flags & PF_EXITING)
>> +		return;
>> +
>> +	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
>> +		return;
>> +
>> +	curr_cpu = task_cpu(p);
>> +	cache_cpu = mm->mm_sched_cpu;
>> +#ifdef CONFIG_NUMA_BALANCING
>> +	if (static_branch_likely(&sched_numa_balancing))
>> +		pref_nid = p->numa_preferred_nid;
>> +#endif
>> +
>> +	scoped_guard (cpus_read_lock) {
>> +		get_scan_cpumasks(cpus, cache_cpu,
>> +				  pref_nid, curr_cpu);
>> +
> 
> IIUC, `get_scan_cpumasks` ORs together the preferred NUMA node, cache CPU's node,
> and current CPU's node. This could result in scanning multiple nodes, not preferring
> the NUMA preferred node.
> 

Yes, it is possible, please see comments below.

>> +		for_each_cpu(cpu, cpus) {
>> +			/* XXX sched_cluster_active */
>> +			struct sched_domain *sd = per_cpu(sd_llc, cpu);
>> +			unsigned long occ, m_occ = 0, a_occ = 0;
>> +			int m_cpu = -1, i;
>> +
>> +			if (!sd)
>> +				continue;
>> +
>> +			for_each_cpu(i, sched_domain_span(sd)) {
>> +				occ = fraction_mm_sched(cpu_rq(i),
>> +							per_cpu_ptr(mm->pcpu_sched, i));
>> +				a_occ += occ;
>> +				if (occ > m_occ) {
>> +					m_occ = occ;
>> +					m_cpu = i;
>> +				}
>> +			}
>> +
>> +			/*
>> +			 * Compare the accumulated occupancy of each LLC. The
>> +			 * reason for using accumulated occupancy rather than average
>> +			 * per CPU occupancy is that it works better in asymmetric LLC
>> +			 * scenarios.
>> +			 * For example, if there are 2 threads in a 4CPU LLC and 3
>> +			 * threads in an 8CPU LLC, it might be better to choose the one
>> +			 * with 3 threads. However, this would not be the case if the
>> +			 * occupancy is divided by the number of CPUs in an LLC (i.e.,
>> +			 * if average per CPU occupancy is used).
>> +			 * Besides, NUMA balancing fault statistics behave similarly:
>> +			 * the total number of faults per node is compared rather than
>> +			 * the average number of faults per CPU. This strategy is also
>> +			 * followed here.
>> +			 */
>> +			if (a_occ > m_a_occ) {
>> +				m_a_occ = a_occ;
>> +				m_a_cpu = m_cpu;
>> +			}
>> +
>> +			if (llc_id(cpu) == llc_id(mm->mm_sched_cpu))
>> +				curr_m_a_occ = a_occ;
>> +
>> +			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
>> +		}
> 
> This means NUMA preference has no effect on the selection, except in the
> unlikely case of exactly equal occupancy across LLCs on different nodes
> (where iteration order determines the winner).
> 
> How does it handle when cache locality and memory locality conflict?
> Shouldn't numa preferred node get preference? Also scanning multiple
> nodes add overhead, so can restricting it to numa preferred node be
> better and scan others only when there is no numa preferred node?
> 

Basically, yes, you're right. Ideally, we should prioritize the NUMA
preferred node as the top priority. There's one case I find hard to
handle: the NUMA preferred node is per task rather than per process.
It's possible that different threads of the same process have different
preferred nodes; as a result, the process-wide preferred LLC could bounce
between different nodes, which might cause costly task migrations across
nodes. As a workaround, we tried to keep the scan CPU mask covering the
process's current preferred LLC to ensure the old preferred LLC is included
in the candidates. After all, we have a 2X threshold for switching the
preferred LLC.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 00/19] Cache Aware Scheduling
  2025-10-14 21:48   ` Tim Chen
@ 2025-10-15  5:38     ` Chen, Yu C
  2025-10-15 18:26       ` Madadi Vineeth Reddy
  0 siblings, 1 reply; 116+ messages in thread
From: Chen, Yu C @ 2025-10-15  5:38 UTC (permalink / raw)
  To: Tim Chen, Madadi Vineeth Reddy
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
	Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Libo Chen, Adam Li,
	Tim Chen, linux-kernel, Yangyu Chen, haoxing990

On 10/15/2025 5:48 AM, Tim Chen wrote:
> On Tue, 2025-10-14 at 17:43 +0530, Madadi Vineeth Reddy wrote:
>> Hi Tim,
>> Thanks for the patch.
>>
>> On 11/10/25 23:54, Tim Chen wrote:

[snip]

>>> [Genoa details]
>>> [ChaCha20-xiangshan]
>>> ChaCha20-xiangshan is a simple benchmark using a static build of an
>>> 8-thread Verilator of XiangShan(RISC-V). The README file can be
>>> found here[2]. The score depends on how aggressive the user set the
>>> /sys/kernel/debug/sched/llc_aggr_tolerance. Using the default values,
>>> there is no much difference observed. While setting the
>>> /sys/kernel/debug/sched/llc_aggr_tolerance to 100, 44% improvment is
>>> observed.
>>>
>>> baseline:
>>> Host time spent: 50,868ms
>>>
>>> sched_cache:
>>> Host time spent: 28,349ms
>>>
>>> The time has been reduced by 44%.
>>
>> Milan showed no improvement across all benchmarks, which could be due to the
>> CCX topology (8 CCXs × 8 CPUs) where the LLC domain is too small for this
>> optimization to be effective. Moreover there could be overhead due to additional
>> computations.
>>
>> ChaCha20-xiangshan improvement in Genoa when llc_aggr_tolerance is set to 100 seems
>> due to having relatively lesser thread count. Please provide the numbers
>> with default values too. Would like to know numbers on varying loads.
> 
> I'll ask Chen Yu who did the Xiangshan experiments if he has those numbers.
> 

Madadi, do you mean the performance score number or active thread number
  when llc_aggr_tolerance is set to 1(default)?
The score is around with sched_cache and llc_aggr_tolerance set to 1.
The active number is 128 per process, and there are 8 processes when
launching the benchmark. I suppose the 128 comes from the number
of online CPUs. Please let me know if you need more data.

Cced Yangyu who's the author of this benchmark.

ls -l /proc/14460/task/ | grep -c '^d'
128

>>
>> In Power 10 and Power 11, the LLC size is 4 threads which is even smaller. Not
>> expecting improvements here but will run some workloads and share the data.
>>
>> Not gone through the entire series yet but are the situations like say in two
>> NUMA system, if a task's preferred LLC is on the wrong NUMA node for its memory,
>> which takes precedence?
> 
> We take preferred NUMA node in the consideration but we do not force task to
> go to the preferred node.
> 
> I remembered initially we limited the consideration to only LLCs in the
> preferred node. But we encountered regressions in hackbench and schbench,
> because when the preferred node don't have any occupancy resulting in preferred LLC
> to be set to -1 (no preference), and resulted in extra task migrations.
> And also the preferred node for hackbench and schbench was volatile
> as they have small memory footprint.  Chen Yu, please chime in if there
> were other reasons you remembered.
> 

Since the preferred NUMA node is per task, while the preferred LLC
is per process, scanning only the current task's preferred node
would lead to cross-node migration. This is because the process's
preferred LLC may not reside within the current task's preferred
node. Such a scenario could leave curr_m_a_occ at 0, and any LLC
with an occupancy > 0 would then trigger a preferred LLC switch.

> We'll need to revisit this part of the code to take care of such
> corner case. I think ideally we should move tasks to the least loaded LLC
> in the preferred node (even if no LLCs have occupancy in the preferred node),
> as long as preferred NUMA node don't changes too often.
> 
> 

Then we might need to introduce a new member in mm_struct to store the old
occupancy, curr_m_a_occ, so that we can reliably compare the old and new
occupancy - to avoid the 0 value of curr_m_a_occ.

>>
>> Also, what about the workloads that don't share data like stress-ng?
>>

The stream is single process stressing the memory without any share
data, we did not observe any difference on stream. We can launch more
tests on stress-ng.

thanks,
Chenyu>
> We can test those.  Ideally the controls to prevent over aggregation to preferred LLC
> would keep stress-ng happy.
> 
>> It will
>> be good to make sure that most other workloads don't suffer. As mentioned,
>> per process knob for llc_aggr_tolerance could help.
> 
> Agree. We are planning to add per process knob for the next version.  One thought is to use
> prctl. Any other suggestions are welcome.
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/19] sched/fair: Avoid cache-aware scheduling for memory-heavy processes
  2025-10-11 18:24 ` [PATCH 18/19] sched/fair: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
@ 2025-10-15  6:57   ` kernel test robot
  2025-10-16  4:44     ` Chen, Yu C
  0 siblings, 1 reply; 116+ messages in thread
From: kernel test robot @ 2025-10-15  6:57 UTC (permalink / raw)
  To: Tim Chen
  Cc: oe-lkp, lkp, Tim Chen, linux-kernel, aubrey.li, yu.c.chen,
	Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Libo Chen, Adam Li, Tim Chen, oliver.sang



Hello,

kernel test robot noticed "UBSAN:array-index-out-of-bounds_in_drivers/base/cacheinfo.c" on:

commit: e8b871200f11decae96692a3f5b385cdc25af231 ("[PATCH 18/19] sched/fair: Avoid cache-aware scheduling for memory-heavy processes")
url: https://github.com/intel-lab-lkp/linux/commits/Tim-Chen/sched-fair-Add-infrastructure-for-cache-aware-load-balancing/20251012-022248
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 45b7f780739a3145aeef24d2dfa02517a6c82ed6
patch link: https://lore.kernel.org/all/00da49fd590b95baad0525660bda4c0ba178243d.1760206683.git.tim.c.chen@linux.intel.com/
patch subject: [PATCH 18/19] sched/fair: Avoid cache-aware scheduling for memory-heavy processes

in testcase: boot

config: i386-randconfig-003-20251012
compiler: clang-20
test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202510151429.2c3f3413-lkp@intel.com


[   12.549731][   T83] ------------[ cut here ]------------
[   12.550388][   T83] UBSAN: array-index-out-of-bounds in drivers/base/cacheinfo.c:37:9
[   12.551060][   T83] index 4294967295 is out of range for type 'unsigned long[8]'
[   12.551580][   T83] CPU: 0 UID: 0 PID: 83 Comm: systemd-journal Not tainted 6.17.0-rc4-00035-ge8b871200f11 #1 PREEMPTLAZY
[   12.551585][   T83] Call Trace:
[   12.551588][   T83]  __dump_stack (lib/dump_stack.c:95)
[   12.551594][   T83]  dump_stack_lvl (lib/dump_stack.c:123)
[   12.551601][   T83]  ubsan_epilogue.llvm.16751680356772289369 (lib/dump_stack.c:129 lib/ubsan.c:233)
[   12.551607][   T83]  __ubsan_handle_out_of_bounds (lib/ubsan.c:?)
[   12.551621][   T83]  get_cpu_cacheinfo (drivers/base/cacheinfo.c:?)
[   12.551625][   T83]  exceed_llc_capacity (include/linux/cacheinfo.h:? kernel/sched/fair.c:1256)
[   12.551632][   T83]  task_cache_work.llvm.12119588225164800824 (kernel/sched/fair.c:1527)
[   12.551637][   T83]  ? task_work_run (kernel/task_work.c:?)
[   12.551641][   T83]  ? _raw_spin_unlock_irq (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 include/linux/spinlock_api_smp.h:159 kernel/locking/spinlock.c:202)
[   12.551644][   T83]  ? __this_cpu_preempt_check (lib/smp_processor_id.c:65)
[   12.551648][   T83]  ? lockdep_hardirqs_on (kernel/locking/lockdep.c:4472)
[   12.551650][   T83]  ? _raw_spin_unlock_irq (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 include/linux/spinlock_api_smp.h:159 kernel/locking/spinlock.c:202)
[   12.551652][   T83]  ? task_work_run (kernel/task_work.c:?)
[   12.551655][   T83]  ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:80)
[   12.551662][   T83]  task_work_run (kernel/task_work.c:229)
[   12.551668][   T83]  resume_user_mode_work (include/linux/resume_user_mode.h:?)
[   12.551673][   T83]  irqentry_exit_to_user_mode (kernel/entry/common.c:53 include/linux/irq-entry-common.h:225 kernel/entry/common.c:73)
[   12.551676][   T83]  ? sysvec_call_function_single (arch/x86/kernel/apic/apic.c:1050)
[   12.551681][   T83]  irqentry_exit (kernel/entry/common.c:210)
[   12.551684][   T83]  sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1050)
[   12.551689][   T83]  handle_exception (arch/x86/entry/entry_32.S:1055)
[   12.551691][   T83] EIP: 0x3764e8f0
[   12.551694][   T83] Code: 00 00 89 c2 eb d2 65 c7 05 28 02 00 00 ff ff ff ff 65 a1 08 00 00 00 f0 83 88 84 00 00 00 10 65 a1 80 00 00 00 e8 f0 02 00 00 <8b> 44 24 04 8b 54 24 08 89 10 8b 54 24 0c 89 50 04 65 8b 15 7c 00
All code
========
   0:	00 00                	add    %al,(%rax)
   2:	89 c2                	mov    %eax,%edx
   4:	eb d2                	jmp    0xffffffffffffffd8
   6:	65 c7 05 28 02 00 00 	movl   $0xffffffff,%gs:0x228(%rip)        # 0x239
   d:	ff ff ff ff 
  11:	65 a1 08 00 00 00 f0 	movabs %gs:0x848883f000000008,%eax
  18:	83 88 84 
  1b:	00 00                	add    %al,(%rax)
  1d:	00 10                	add    %dl,(%rax)
  1f:	65 a1 80 00 00 00 e8 	movabs %gs:0x2f0e800000080,%eax
  26:	f0 02 00 
  29:*	00 8b 44 24 04 8b    	add    %cl,-0x74fbdbbc(%rbx)		<-- trapping instruction
  2f:	54                   	push   %rsp
  30:	24 08                	and    $0x8,%al
  32:	89 10                	mov    %edx,(%rax)
  34:	8b 54 24 0c          	mov    0xc(%rsp),%edx
  38:	89 50 04             	mov    %edx,0x4(%rax)
  3b:	65                   	gs
  3c:	8b                   	.byte 0x8b
  3d:	15                   	.byte 0x15
  3e:	7c 00                	jl     0x40

Code starting with the faulting instruction
===========================================
   0:	8b 44 24 04          	mov    0x4(%rsp),%eax
   4:	8b 54 24 08          	mov    0x8(%rsp),%edx
   8:	89 10                	mov    %edx,(%rax)
   a:	8b 54 24 0c          	mov    0xc(%rsp),%edx
   e:	89 50 04             	mov    %edx,0x4(%rax)
  11:	65                   	gs
  12:	8b                   	.byte 0x8b
  13:	15                   	.byte 0x15
  14:	7c 00                	jl     0x16
[   12.551696][   T83] EAX: 3fdacc1c EBX: 36a49ba8 ECX: 36a49d64 EDX: 00000001
[   12.551697][   T83] ESI: 00000000 EDI: 36a49b40 EBP: 00000000 ESP: 3fdacbdc
[   12.551699][   T83] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200212
[   12.551708][   T83]  ? sysvec_call_function_single (arch/x86/kernel/apic/apic.c:1050)
[   12.551717][   T83] ---[ end trace ]---


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251015/202510151429.2c3f3413-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing
  2025-10-11 18:24 ` [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing Tim Chen
@ 2025-10-15  7:23   ` kernel test robot
  2025-10-15 15:08   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 116+ messages in thread
From: kernel test robot @ 2025-10-15  7:23 UTC (permalink / raw)
  To: Tim Chen
  Cc: oe-lkp, lkp, Chen Yu, linux-kernel, aubrey.li, Peter Zijlstra,
	Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Tim Chen,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Libo Chen, Adam Li, Tim Chen, oliver.sang



Hello,

kernel test robot noticed "UBSAN:array-index-out-of-bounds_in_kernel/sched/fair.c" on:

commit: a9872e774986636a909d5cc6f6bde8b44add9f33 ("[PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing")
url: https://github.com/intel-lab-lkp/linux/commits/Tim-Chen/sched-fair-Add-infrastructure-for-cache-aware-load-balancing/20251012-022248
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 45b7f780739a3145aeef24d2dfa02517a6c82ed6
patch link: https://lore.kernel.org/all/ca1946de63ad9f0ae99e079a74d70c55879cc0b6.1760206683.git.tim.c.chen@linux.intel.com/
patch subject: [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing

in testcase: boot

config: i386-randconfig-003-20251012
compiler: clang-20
test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202510151514.652b80d2-lkp@intel.com


[    0.527179][    T2] ------------[ cut here ]------------
[    0.527563][    T2] UBSAN: array-index-out-of-bounds in kernel/sched/fair.c:10965:6
[    0.527563][    T2] index -1 is out of range for type 'unsigned int[8]'
[    0.527563][    T2] CPU: 0 UID: 0 PID: 2 Comm: kthreadd Not tainted 6.17.0-rc4-00027-ga9872e774986 #1 PREEMPTLAZY
[    0.527563][    T2] Call Trace:
[    0.527563][    T2]  ? __dump_stack (lib/dump_stack.c:95)
[    0.527563][    T2]  ? dump_stack_lvl (lib/dump_stack.c:123)
[    0.527563][    T2]  ? ubsan_epilogue.llvm.14679606394994327670 (lib/dump_stack.c:129 lib/ubsan.c:233)
[    0.527563][    T2]  ? __ubsan_handle_out_of_bounds (lib/ubsan.c:?)
[    0.527563][    T2]  ? sched_balance_rq (kernel/sched/fair.c:?)
[    0.527563][    T2]  ? sched_balance_newidle (kernel/sched/fair.c:13514)
[    0.527563][    T2]  ? pick_next_task_fair (kernel/sched/fair.c:9328)
[    0.527563][    T2]  ? dequeue_task (kernel/sched/core.c:2136)
[    0.527563][    T2]  ? __pick_next_task (kernel/sched/core.c:5983)
[    0.527563][    T2]  ? __schedule (kernel/sched/core.c:?)
[    0.527563][    T2]  ? schedule (kernel/sched/core.c:7025 kernel/sched/core.c:7039)
[    0.527563][    T2]  ? kthreadd (kernel/kthread.c:835)
[    0.527563][    T2]  ? kthread_stop_put (kernel/kthread.c:818)
[    0.527563][    T2]  ? ret_from_fork (arch/x86/kernel/process.c:154)
[    0.527563][    T2]  ? kthread_stop_put (kernel/kthread.c:818)
[    0.527563][    T2]  ? debug_smp_processor_id (lib/smp_processor_id.c:58)
[    0.527563][    T2]  ? kthread_stop_put (kernel/kthread.c:818)
[    0.527563][    T2]  ? ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
[    0.527563][    T2]  ? entry_INT80_32 (arch/x86/entry/entry_32.S:945)
[    0.527563][    T2] ---[ end trace ]---


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251015/202510151514.652b80d2-lkp@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 02/19] sched/fair: Record per-LLC utilization to guide cache-aware scheduling decisions
  2025-10-11 18:24 ` [PATCH 02/19] sched/fair: Record per-LLC utilization to guide cache-aware scheduling decisions Tim Chen
@ 2025-10-15 10:15   ` Peter Zijlstra
  2025-10-15 16:27     ` Chen, Yu C
  2025-10-27  5:01   ` K Prateek Nayak
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-15 10:15 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Libo Chen, Adam Li, Tim Chen, linux-kernel

On Sat, Oct 11, 2025 at 11:24:39AM -0700, Tim Chen wrote:
> +static void record_sg_llc_stats(struct lb_env *env,
> +				struct sg_lb_stats *sgs,
> +				struct sched_group *group)
> +{
> +	/*
> +	 * Find the child domain on env->dst_cpu. This domain
> +	 * is either the domain that spans this group(if the
> +	 * group is a local group), or the sibling domain of
> +	 * this group.
> +	 */
> +	struct sched_domain *sd = env->sd->child;
> +	struct sched_domain_shared *sd_share;
> +
> +	if (!sched_feat(SCHED_CACHE) || env->idle == CPU_NEWLY_IDLE)
> +		return;
> +
> +	/* only care about sched domains spanning a LLC */
> +	if (sd != rcu_dereference(per_cpu(sd_llc, env->dst_cpu)))
> +		return;
> +
> +	/*
> +	 * At this point we know this group spans a LLC domain.
> +	 * Record the statistic of this group in its corresponding
> +	 * shared LLC domain.
> +	 */
> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared,
> +					   cpumask_first(sched_group_span(group))));

Isn't this sd->shared ? Or did I loose the plot somewhere?

> +	if (!sd_share)
> +		return;
> +
> +	if (READ_ONCE(sd_share->util_avg) != sgs->group_util)
> +		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
> +
> +	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
> +		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
> +}

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs
  2025-10-11 18:24 ` [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs Tim Chen
@ 2025-10-15 11:04   ` Peter Zijlstra
  2025-10-15 16:25     ` Chen, Yu C
  2025-10-27  5:42   ` K Prateek Nayak
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-15 11:04 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Chen Yu,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Libo Chen, Adam Li, Tim Chen, linux-kernel

On Sat, Oct 11, 2025 at 11:24:41AM -0700, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
> 
> Enable cache-aware load balancing only if at least 1 NUMA node has
> more than one LLC.
> 
> Suggested-by: Libo Chen <libo.chen@oracle.com>
> Suggested-by: Adam Li <adamli@os.amperecomputing.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c     | 15 ++++++++++++---
>  kernel/sched/sched.h    |  1 +
>  kernel/sched/topology.c | 14 ++++++++++++--
>  3 files changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cd080468ddc9..3d643449c48c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1208,6 +1208,14 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>  __read_mostly unsigned int llc_overload_pct       = 50;
>  __read_mostly unsigned int llc_imb_pct            = 20;
>  
> +DEFINE_STATIC_KEY_FALSE(sched_cache_allowed);
> +
> +static inline bool sched_cache_enabled(void)
> +{
> +	return sched_feat(SCHED_CACHE) &&
> +		static_branch_likely(&sched_cache_allowed);
> +}

Urgh; do we really need _2_ static keys stacked for this? I'm thinking
one should be well enough.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 05/19] sched/fair: Add LLC index mapping for CPUs
  2025-10-11 18:24 ` [PATCH 05/19] sched/fair: Add LLC index mapping for CPUs Tim Chen
@ 2025-10-15 11:08   ` Peter Zijlstra
  2025-10-15 11:58   ` Peter Zijlstra
  1 sibling, 0 replies; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-15 11:08 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Sat, Oct 11, 2025 at 11:24:42AM -0700, Tim Chen wrote:
> Introduce an index mapping between CPUs and their LLCs. This provides

"Introduce a *dense* mapping ...", since we already have a mapping, but
as you explain below, that is sparse and not well suited for indexing.

> a continuous per LLC index needed for cache-aware load balancing in
> later patches.

> The maximum number of LLCs is limited by CONFIG_NR_LLCS. If the number
> of LLCs available exceeds CONFIG_NR_LLCS, the cache aware load balance
> is disabled. To further save memory, this array could be converted to
> dynamic allocation in the future, or the LLC index could be made NUMA
> node-wide.


> +config NR_LLCS
> +	int "Maximum number of Last Level Caches"
> +	range 2 1024
> +	depends on SMP && SCHED_CACHE
> +	default 64
> +	help
> +	  This allows you to specify the maximum number of last level caches
> +	  this kernel will support for cache aware scheduling.

Not really a fan of this max thing. I suppose I'll see the use in the
next few patches, but ideally we'd start with the dynamic solution.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/19] sched/fair: Assign preferred LLC ID to processes
  2025-10-14  5:16   ` Chen, Yu C
@ 2025-10-15 11:15     ` Peter Zijlstra
  2025-10-16  3:13       ` Chen, Yu C
  2025-10-17  4:50       ` Chen, Yu C
  0 siblings, 2 replies; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-15 11:15 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vern Hao, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Tue, Oct 14, 2025 at 01:16:16PM +0800, Chen, Yu C wrote:

> The question becomes: how can we figure out the threads that share
> data? Can the kernel detect this, or get the hint from user space?

This needs the PMU, then you can steer using cache-miss ratios. But then
people will hate us for using counters.

> Yes, the numa_group in NUMA load balancing indicates
> that several tasks manipulate the same page, which could be an
> indicator. Besides, if task A frequently wakes up task B, does it
> mean A and B have the potential to share data? Furthermore, if
> task A wakes up B via a pipe, it might also indicate that A has
> something to share with B. I just wonder if we can introduce a
> structure to gather this information together.

The wakeup or pipe relation might be small relative to the working set.
Consider a sharded in memory database, where the query comes in through
the pipe/socket/wakeup. This query is small, but then it needs to go
trawl through its memory to find the answer.

Something we *could* look at -- later -- is an interface to create
thread groups, such that userspace that is clever enough can communicate
this. But then there is the ago old question, will there be sufficient
users to justify the maintenance of said interface.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing
  2025-10-14 19:12   ` Madadi Vineeth Reddy
  2025-10-15  4:54     ` Chen, Yu C
@ 2025-10-15 11:54     ` Peter Zijlstra
  2025-10-15 16:07       ` Chen, Yu C
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-15 11:54 UTC (permalink / raw)
  To: Madadi Vineeth Reddy
  Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
	Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Chen Yu, Libo Chen,
	Adam Li, Tim Chen, linux-kernel

On Wed, Oct 15, 2025 at 12:42:48AM +0530, Madadi Vineeth Reddy wrote:
> > +static void get_scan_cpumasks(cpumask_var_t cpus, int cache_cpu,
> > +			      int pref_nid, int curr_cpu)
> > +{
> > +#ifdef CONFIG_NUMA_BALANCING
> > +	/* First honor the task's preferred node. */
> > +	if (pref_nid != NUMA_NO_NODE)
> > +		cpumask_or(cpus, cpus, cpumask_of_node(pref_nid));
> > +#endif
> > +
> > +	/* Next honor the task's cache CPU if it is not included. */
> > +	if (cache_cpu != -1 && !cpumask_test_cpu(cache_cpu, cpus))
> > +		cpumask_or(cpus, cpus,
> > +			   cpumask_of_node(cpu_to_node(cache_cpu)));
> > +
> > +	/*
> > +	 * Lastly make sure that the task's current running node is
> > +	 * considered.
> > +	 */
> > +	if (!cpumask_test_cpu(curr_cpu, cpus))
> > +		cpumask_or(cpus, cpus, cpumask_of_node(cpu_to_node(curr_cpu)));
> > +}
> > +
> > +static void __no_profile task_cache_work(struct callback_head *work)
> > +{
> > +	struct task_struct *p = current;
> > +	struct mm_struct *mm = p->mm;
> > +	unsigned long m_a_occ = 0;
> > +	unsigned long curr_m_a_occ = 0;
> > +	int cpu, m_a_cpu = -1, cache_cpu,
> > +	    pref_nid = NUMA_NO_NODE, curr_cpu;
> > +	cpumask_var_t cpus;
> > +
> > +	WARN_ON_ONCE(work != &p->cache_work);
> > +
> > +	work->next = work;
> > +
> > +	if (p->flags & PF_EXITING)
> > +		return;
> > +
> > +	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
> > +		return;
> > +
> > +	curr_cpu = task_cpu(p);
> > +	cache_cpu = mm->mm_sched_cpu;
> > +#ifdef CONFIG_NUMA_BALANCING
> > +	if (static_branch_likely(&sched_numa_balancing))
> > +		pref_nid = p->numa_preferred_nid;
> > +#endif
> > +
> > +	scoped_guard (cpus_read_lock) {
> > +		get_scan_cpumasks(cpus, cache_cpu,
> > +				  pref_nid, curr_cpu);
> > +
> 
> IIUC, `get_scan_cpumasks` ORs together the preferred NUMA node, cache CPU's node,
> and current CPU's node. This could result in scanning multiple nodes, not preferring
> the NUMA preferred node.

So this used to be online_mask, and is now magically changed to this
more limited mask.

Could you split this change out and have it have a justification?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 05/19] sched/fair: Add LLC index mapping for CPUs
  2025-10-11 18:24 ` [PATCH 05/19] sched/fair: Add LLC index mapping for CPUs Tim Chen
  2025-10-15 11:08   ` Peter Zijlstra
@ 2025-10-15 11:58   ` Peter Zijlstra
  2025-10-15 20:12     ` Tim Chen
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-15 11:58 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Sat, Oct 11, 2025 at 11:24:42AM -0700, Tim Chen wrote:
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 2675db980f70..4bd033060f1d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -659,6 +659,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
> +DEFINE_PER_CPU(int, sd_llc_idx);
>  DEFINE_PER_CPU(int, sd_share_id);
>  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);

There is literally *ONE* user of sd_llc_id, cpus_share_cache(), surely
that can equally use sd_llc_idx?

That is to say, do we really need two numbers for this?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue
  2025-10-11 18:24 ` [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue Tim Chen
@ 2025-10-15 12:05   ` Peter Zijlstra
  2025-10-15 20:03     ` Tim Chen
  2025-10-27  6:04   ` K Prateek Nayak
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-15 12:05 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Sat, Oct 11, 2025 at 11:24:44AM -0700, Tim Chen wrote:
> @@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  		struct rq *rq = rq_of(cfs_rq);
>  
>  		account_numa_enqueue(rq, task_of(se));
> +		account_llc_enqueue(rq, task_of(se));
>  		list_add(&se->group_node, &rq->cfs_tasks);

Here and...

>  	}
>  	cfs_rq->nr_queued++;
> @@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  	update_load_sub(&cfs_rq->load, se->load.weight);
>  	if (entity_is_task(se)) {
>  		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
> +		account_llc_dequeue(rq_of(cfs_rq), task_of(se));

... here, could you please check the compiler is doing CSE of task_of()?

>  		list_del_init(&se->group_node);
>  	}
>  	cfs_rq->nr_queued--;
> +
> +	/* safeguard to clear the cache aware data */
> +	if (!parent_entity(se) && !cfs_rq->nr_queued)
> +		reset_llc_stats(rq_of(cfs_rq));

I'm confused -- why?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 08/19] sched/fair: Introduce per runqueue task LLC preference counter
  2025-10-11 18:24 ` [PATCH 08/19] sched/fair: Introduce per runqueue task LLC preference counter Tim Chen
@ 2025-10-15 12:21   ` Peter Zijlstra
  2025-10-15 20:41     ` Tim Chen
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-15 12:21 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Sat, Oct 11, 2025 at 11:24:45AM -0700, Tim Chen wrote:
> Each runqueue is assigned a static array where each element tracks
> the number of tasks preferring a given LLC, indexed from 0 to
> NR_LLCS.
> 
> For example, rq->nr_pref_llc[3] = 2 signifies that there are 2 tasks on
> this runqueue which prefer to run within LLC3 (indexed from 0 to NR_LLCS
> 
> The load balancer can use this information to identify busy runqueues
> and migrate tasks to their preferred LLC domains.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c  | 35 +++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h |  1 +
>  2 files changed, 36 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fd315937c0cf..b7a68fe7601b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1235,22 +1235,51 @@ static inline int llc_idx(int cpu)
>  	return per_cpu(sd_llc_idx, cpu);
>  }
>  
> +static inline int pref_llc_idx(struct task_struct *p)
> +{
> +	return llc_idx(p->preferred_llc);
> +}
> +
>  static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
>  {
> +	int pref_llc;
> +
>  	if (!sched_cache_enabled())
>  		return;
>  
>  	rq->nr_llc_running += (p->preferred_llc != -1);
>  	rq->nr_pref_llc_running += (p->preferred_llc == task_llc(p));
> +
> +	if (p->preferred_llc < 0)
> +		return;
> +
> +	pref_llc = pref_llc_idx(p);
> +	if (pref_llc < 0)
> +		return;
> +
> +	++rq->nr_pref_llc[pref_llc];
>  }
>  
>  static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
>  {
> +	int pref_llc;
> +
>  	if (!sched_cache_enabled())
>  		return;
>  
>  	rq->nr_llc_running -= (p->preferred_llc != -1);
>  	rq->nr_pref_llc_running -= (p->preferred_llc == task_llc(p));
> +
> +	if (p->preferred_llc < 0)
> +		return;
> +
> +	pref_llc = pref_llc_idx(p);
> +	if (pref_llc < 0)
> +		return;
> +
> +	/* avoid negative counter */
> +	if (rq->nr_pref_llc[pref_llc] > 0)
> +		--rq->nr_pref_llc[pref_llc];

How!? Also, please use post increment/decrement operators.

>  }
>  
>  void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
> @@ -1524,10 +1553,16 @@ void init_sched_mm(struct task_struct *p)
>  
>  void reset_llc_stats(struct rq *rq)
>  {
> +	int i = 0;
> +
>  	if (!sched_cache_enabled())
>  		return;
>  
>  	rq->nr_llc_running = 0;
> +
> +	for (i = 0; i < max_llcs; ++i)
> +		rq->nr_pref_llc[i] = 0;
> +
>  	rq->nr_pref_llc_running = 0;
>  }

Still don't understand why this thing exists..

>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3ab64067acc6..b801d32d5fba 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1101,6 +1101,7 @@ struct rq {
>  #ifdef CONFIG_SCHED_CACHE
>  	unsigned int		nr_pref_llc_running;
>  	unsigned int		nr_llc_running;
> +	unsigned int		nr_pref_llc[NR_LLCS];

Gah, yeah, lets not do this. Just (re)alloc the thing on topology
changes or something.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 09/19] sched/fair: Count tasks prefering each LLC in a sched group
  2025-10-11 18:24 ` [PATCH 09/19] sched/fair: Count tasks prefering each LLC in a sched group Tim Chen
@ 2025-10-15 12:22   ` Peter Zijlstra
  2025-10-15 20:42     ` Tim Chen
  2025-10-15 12:25   ` Peter Zijlstra
  2025-10-27  8:33   ` K Prateek Nayak
  2 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-15 12:22 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Sat, Oct 11, 2025 at 11:24:46AM -0700, Tim Chen wrote:
> During LLC load balancing, tabulate the number of tasks on each runqueue
> that prefer a given destination LLC in a sched group.
> 
> For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
> balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
> 2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
> selected as the busiest source to pick tasks from.
> 
> Within a source LLC, the total number of tasks preferring a destination
> LLC is computed by summing counts across all CPUs in that runqueue. For
> instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
> LLC3, the total for LLC0 is 3.
> 
> These statistics allow the load balancer to choose tasks from source
> sched groups that best match their preferred LLCs.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b7a68fe7601b..cbd1e97bca4b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10399,6 +10399,9 @@ struct sg_lb_stats {
>  	unsigned int nr_numa_running;
>  	unsigned int nr_preferred_running;
>  #endif
> +#ifdef CONFIG_SCHED_CACHE
> +	unsigned int nr_pref_llc[NR_LLCS];
> +#endif
>  };

Hahahaha, no! We have this on-stack, this cannot be.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 09/19] sched/fair: Count tasks prefering each LLC in a sched group
  2025-10-11 18:24 ` [PATCH 09/19] sched/fair: Count tasks prefering each LLC in a sched group Tim Chen
  2025-10-15 12:22   ` Peter Zijlstra
@ 2025-10-15 12:25   ` Peter Zijlstra
  2025-10-15 20:43     ` Tim Chen
  2025-10-27  8:33   ` K Prateek Nayak
  2 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-15 12:25 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Sat, Oct 11, 2025 at 11:24:46AM -0700, Tim Chen wrote:

> +#ifdef CONFIG_SCHED_CACHE
> +		if (sched_cache_enabled()) {
> +			int j;
> +
> +			for (j = 0; j < max_llcs; ++j)
> +				sgs->nr_pref_llc[j] += rq->nr_pref_llc[j];

We live in the year 2025 and have embraced c99, please write as:

	for (int j = 0; j < max_llcs; j++)

> +		}
> +#endif

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing
  2025-10-11 18:24 ` [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing Tim Chen
  2025-10-15  7:23   ` kernel test robot
@ 2025-10-15 15:08   ` Peter Zijlstra
  2025-10-15 21:28     ` Tim Chen
  2025-10-15 15:10   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-15 15:08 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Sat, Oct 11, 2025 at 11:24:47AM -0700, Tim Chen wrote:
> During LLC load balancing, first check for tasks that prefer the
> destination LLC and balance them to it before others.
> 
> Mark source sched groups containing tasks preferring non local LLCs
> with the group_llc_balance flag. This ensures the load balancer later
> pulls or pushes these tasks toward their preferred LLCs.
> 
> Co-developed-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---

For me this patch is cut too fine; it only sets group_llc_balance but
then we don't see how it is used.

>  kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 41 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cbd1e97bca4b..af7b578eaa06 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9822,8 +9822,7 @@ static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu
>  	else
>  		return mig_unrestricted;
>  
> -	return can_migrate_llc(src_cpu, dst_cpu,
> -			       task_util(p), to_pref);
> +	return can_migrate_llc(src_cpu, dst_cpu, task_util(p), to_pref);
>  }
>  
>  #else
> @@ -10394,6 +10393,7 @@ struct sg_lb_stats {
>  	enum group_type group_type;
>  	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
>  	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
> +	unsigned int group_llc_balance;		/* Tasks should be moved to preferred LLC */
>  	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
>  #ifdef CONFIG_NUMA_BALANCING
>  	unsigned int nr_numa_running;
> @@ -10849,11 +10849,45 @@ static void record_sg_llc_stats(struct lb_env *env,
>  	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
>  		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
>  }
> +
> +/*
> + * Do LLC balance on sched group that contains LLC, and have tasks preferring
> + * to run on LLC in idle dst_cpu.
> + */
> +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> +			       struct sched_group *group)
> +{
> +	struct sched_domain *child = env->sd->child;
> +	int llc;
> +
> +	if (!sched_cache_enabled())
> +		return false;
> +
> +	if (env->sd->flags & SD_SHARE_LLC)
> +		return false;
> +
> +	/* only care about task migration among LLCs */
> +	if (child && !(child->flags & SD_SHARE_LLC))
> +		return false;
> +
> +	llc = llc_idx(env->dst_cpu);
> +	if (sgs->nr_pref_llc[llc] > 0 &&

Nit: s/> 0// would be the same, right?

> +	    can_migrate_llc(env->src_cpu, env->dst_cpu, 0, true) == mig_llc)
> +		return true;
> +
> +	return false;
> +}
>  #else
>  static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
>  				       struct sched_group *group)
>  {
>  }
> +
> +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> +			       struct sched_group *group)
> +{
> +	return false;
> +}
>  #endif
>  
>  /**
> @@ -10954,6 +10988,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>  
>  	record_sg_llc_stats(env, sgs, group);
> +
> +	/* Check for tasks in this group can be moved to their preferred LLC */
> +	if (!local_group && llc_balance(env, sgs, group))
> +		sgs->group_llc_balance = 1;

We now have 3 (or so) branches that start with:

	if (!local_group &&

perhaps collate that some?

> +
>  	/* Computing avg_load makes sense only when group is overloaded */
>  	if (sgs->group_type == group_overloaded)
>  		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
> -- 
> 2.32.0
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing
  2025-10-11 18:24 ` [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing Tim Chen
  2025-10-15  7:23   ` kernel test robot
  2025-10-15 15:08   ` Peter Zijlstra
@ 2025-10-15 15:10   ` Peter Zijlstra
  2025-10-15 16:03     ` Chen, Yu C
  2025-10-24  9:32   ` Aaron Lu
  2025-10-27  6:29   ` K Prateek Nayak
  4 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-15 15:10 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Sat, Oct 11, 2025 at 11:24:47AM -0700, Tim Chen wrote:

> +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> +			       struct sched_group *group)
> +{
> +	struct sched_domain *child = env->sd->child;
> +	int llc;
> +
> +	if (!sched_cache_enabled())
> +		return false;
> +
> +	if (env->sd->flags & SD_SHARE_LLC)
> +		return false;
> +
> +	/* only care about task migration among LLCs */
> +	if (child && !(child->flags & SD_SHARE_LLC))
> +		return false;
> +
> +	llc = llc_idx(env->dst_cpu);
> +	if (sgs->nr_pref_llc[llc] > 0 &&

Robot says llc can be -1 here, and it don't like doing out of bound
array access.

> +	    can_migrate_llc(env->src_cpu, env->dst_cpu, 0, true) == mig_llc)
> +		return true;
> +
> +	return false;
> +}

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 11/19] sched/fair: Identify busiest sched_group for LLC-aware load balancing
  2025-10-11 18:24 ` [PATCH 11/19] sched/fair: Identify busiest sched_group for LLC-aware load balancing Tim Chen
@ 2025-10-15 15:24   ` Peter Zijlstra
  2025-10-15 21:18     ` Tim Chen
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-15 15:24 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Sat, Oct 11, 2025 at 11:24:48AM -0700, Tim Chen wrote:

> @@ -11035,6 +11059,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	     sds->local_stat.group_type != group_has_spare))
>  		return false;
>  
> +	/* deal with prefer LLC load balance, if failed, fall into normal load balance */
> +	if (update_llc_busiest(env, busiest, sgs))
> +		return true;
> +
> +	/*
> +	 * If the busiest group has tasks with LLC preference,
> +	 * skip normal load balance.
> +	 */
> +	if (busiest->group_llc_balance)
> +		return false;
> +
>  	if (sgs->group_type > busiest->group_type)
>  		return true;

This feels weird.. should we really override things like group_imbalance
or group_misfit_task ?


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing
  2025-10-15 15:10   ` Peter Zijlstra
@ 2025-10-15 16:03     ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-15 16:03 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Libo Chen, Adam Li, Tim Chen, linux-kernel

On 10/15/2025 11:10 PM, Peter Zijlstra wrote:
> On Sat, Oct 11, 2025 at 11:24:47AM -0700, Tim Chen wrote:
> 
>> +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
>> +			       struct sched_group *group)
>> +{
>> +	struct sched_domain *child = env->sd->child;
>> +	int llc;
>> +
>> +	if (!sched_cache_enabled())
>> +		return false;
>> +
>> +	if (env->sd->flags & SD_SHARE_LLC)
>> +		return false;
>> +
>> +	/* only care about task migration among LLCs */
>> +	if (child && !(child->flags & SD_SHARE_LLC))
>> +		return false;
>> +
>> +	llc = llc_idx(env->dst_cpu);
>> +	if (sgs->nr_pref_llc[llc] > 0 &&
> 
> Robot says llc can be -1 here, and it don't like doing out of bound
> array access.
> 

Humm, there seems to be a race condition during bootup with 
build_sched_domains(),
where the per-cpu(sd_llc_idx) is reset to -1 at the beginning. We might
need to add a sanity check here.


thanks,
Chenyu
>> +	    can_migrate_llc(env->src_cpu, env->dst_cpu, 0, true) == mig_llc)
>> +		return true;
>> +
>> +	return false;
>> +}

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing
  2025-10-15 11:54     ` Peter Zijlstra
@ 2025-10-15 16:07       ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-15 16:07 UTC (permalink / raw)
  To: Peter Zijlstra, Madadi Vineeth Reddy
  Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
	Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Tim Chen,
	linux-kernel

On 10/15/2025 7:54 PM, Peter Zijlstra wrote:
> On Wed, Oct 15, 2025 at 12:42:48AM +0530, Madadi Vineeth Reddy wrote:
>>> +static void get_scan_cpumasks(cpumask_var_t cpus, int cache_cpu,
>>> +			      int pref_nid, int curr_cpu)
>>> +{
>>> +#ifdef CONFIG_NUMA_BALANCING
>>> +	/* First honor the task's preferred node. */
>>> +	if (pref_nid != NUMA_NO_NODE)
>>> +		cpumask_or(cpus, cpus, cpumask_of_node(pref_nid));
>>> +#endif
>>> +
>>> +	/* Next honor the task's cache CPU if it is not included. */
>>> +	if (cache_cpu != -1 && !cpumask_test_cpu(cache_cpu, cpus))
>>> +		cpumask_or(cpus, cpus,
>>> +			   cpumask_of_node(cpu_to_node(cache_cpu)));
>>> +
>>> +	/*
>>> +	 * Lastly make sure that the task's current running node is
>>> +	 * considered.
>>> +	 */
>>> +	if (!cpumask_test_cpu(curr_cpu, cpus))
>>> +		cpumask_or(cpus, cpus, cpumask_of_node(cpu_to_node(curr_cpu)));
>>> +}
>>> +
>>> +static void __no_profile task_cache_work(struct callback_head *work)
>>> +{
>>> +	struct task_struct *p = current;
>>> +	struct mm_struct *mm = p->mm;
>>> +	unsigned long m_a_occ = 0;
>>> +	unsigned long curr_m_a_occ = 0;
>>> +	int cpu, m_a_cpu = -1, cache_cpu,
>>> +	    pref_nid = NUMA_NO_NODE, curr_cpu;
>>> +	cpumask_var_t cpus;
>>> +
>>> +	WARN_ON_ONCE(work != &p->cache_work);
>>> +
>>> +	work->next = work;
>>> +
>>> +	if (p->flags & PF_EXITING)
>>> +		return;
>>> +
>>> +	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
>>> +		return;
>>> +
>>> +	curr_cpu = task_cpu(p);
>>> +	cache_cpu = mm->mm_sched_cpu;
>>> +#ifdef CONFIG_NUMA_BALANCING
>>> +	if (static_branch_likely(&sched_numa_balancing))
>>> +		pref_nid = p->numa_preferred_nid;
>>> +#endif
>>> +
>>> +	scoped_guard (cpus_read_lock) {
>>> +		get_scan_cpumasks(cpus, cache_cpu,
>>> +				  pref_nid, curr_cpu);
>>> +
>>
>> IIUC, `get_scan_cpumasks` ORs together the preferred NUMA node, cache CPU's node,
>> and current CPU's node. This could result in scanning multiple nodes, not preferring
>> the NUMA preferred node.
> 
> So this used to be online_mask, and is now magically changed to this
> more limited mask.
> 
> Could you split this change out and have it have a justification?

OK, we will do this and provide an explanation.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs
  2025-10-15 11:04   ` Peter Zijlstra
@ 2025-10-15 16:25     ` Chen, Yu C
  2025-10-15 16:36       ` Shrikanth Hegde
  2025-10-16  7:40       ` Peter Zijlstra
  0 siblings, 2 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-15 16:25 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel

On 10/15/2025 7:04 PM, Peter Zijlstra wrote:
> On Sat, Oct 11, 2025 at 11:24:41AM -0700, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> Enable cache-aware load balancing only if at least 1 NUMA node has
>> more than one LLC.
>>
>> Suggested-by: Libo Chen <libo.chen@oracle.com>
>> Suggested-by: Adam Li <adamli@os.amperecomputing.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>   kernel/sched/fair.c     | 15 ++++++++++++---
>>   kernel/sched/sched.h    |  1 +
>>   kernel/sched/topology.c | 14 ++++++++++++--
>>   3 files changed, 25 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index cd080468ddc9..3d643449c48c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1208,6 +1208,14 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>>   __read_mostly unsigned int llc_overload_pct       = 50;
>>   __read_mostly unsigned int llc_imb_pct            = 20;
>>   
>> +DEFINE_STATIC_KEY_FALSE(sched_cache_allowed);
>> +
>> +static inline bool sched_cache_enabled(void)
>> +{
>> +	return sched_feat(SCHED_CACHE) &&
>> +		static_branch_likely(&sched_cache_allowed);
>> +}
> 
> Urgh; do we really need _2_ static keys stacked for this? I'm thinking
> one should be well enough.

SCHED_CACHE allows user space to turn on/off the feature at runtime,
while sched_cache_allow is a hardware capability. This capability is
  disabled if there are no multiple LLCs within one node. I’m not sure
if using one key could support the above two scenarios.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 02/19] sched/fair: Record per-LLC utilization to guide cache-aware scheduling decisions
  2025-10-15 10:15   ` Peter Zijlstra
@ 2025-10-15 16:27     ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-15 16:27 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel

On 10/15/2025 6:15 PM, Peter Zijlstra wrote:
> On Sat, Oct 11, 2025 at 11:24:39AM -0700, Tim Chen wrote:
>> +static void record_sg_llc_stats(struct lb_env *env,
>> +				struct sg_lb_stats *sgs,
>> +				struct sched_group *group)
>> +{
>> +	/*
>> +	 * Find the child domain on env->dst_cpu. This domain
>> +	 * is either the domain that spans this group(if the
>> +	 * group is a local group), or the sibling domain of
>> +	 * this group.
>> +	 */
>> +	struct sched_domain *sd = env->sd->child;
>> +	struct sched_domain_shared *sd_share;
>> +
>> +	if (!sched_feat(SCHED_CACHE) || env->idle == CPU_NEWLY_IDLE)
>> +		return;
>> +
>> +	/* only care about sched domains spanning a LLC */
>> +	if (sd != rcu_dereference(per_cpu(sd_llc, env->dst_cpu)))
>> +		return;
>> +
>> +	/*
>> +	 * At this point we know this group spans a LLC domain.
>> +	 * Record the statistic of this group in its corresponding
>> +	 * shared LLC domain.
>> +	 */
>> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared,
>> +					   cpumask_first(sched_group_span(group))));
> 
> Isn't this sd->shared ? Or did I loose the plot somewhere?
>

The sd here refers to the domain that covers the local_group,
which is derived from env->dst_cpu. Meanwhile, sd_share corresponds
to the domain covering the 'group' that may be a sibling of the local_group.
Our goal is to update the statistics of this latter 'group'. It is assumed
that the local_group and its sibling 'group' have the same CPU weight,
meaning they each cover one LLC. We check sd simply because it is the
only domain we can obtain via env->sd->child (apologies, but I haven’t
found a way to get the corresponding domain that spans the 'group' itself).

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs
  2025-10-15 16:25     ` Chen, Yu C
@ 2025-10-15 16:36       ` Shrikanth Hegde
  2025-10-15 17:01         ` Chen, Yu C
  2025-10-16  7:40       ` Peter Zijlstra
  1 sibling, 1 reply; 116+ messages in thread
From: Shrikanth Hegde @ 2025-10-15 16:36 UTC (permalink / raw)
  To: Chen, Yu C, Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Tim Chen,
	linux-kernel



On 10/15/25 9:55 PM, Chen, Yu C wrote:
> On 10/15/2025 7:04 PM, Peter Zijlstra wrote:
>> On Sat, Oct 11, 2025 at 11:24:41AM -0700, Tim Chen wrote:
>>> From: Chen Yu <yu.c.chen@intel.com>
>>>
>>> Enable cache-aware load balancing only if at least 1 NUMA node has
>>> more than one LLC.
>>>
>>> Suggested-by: Libo Chen <libo.chen@oracle.com>
>>> Suggested-by: Adam Li <adamli@os.amperecomputing.com>
>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> ---
>>>   kernel/sched/fair.c     | 15 ++++++++++++---
>>>   kernel/sched/sched.h    |  1 +
>>>   kernel/sched/topology.c | 14 ++++++++++++--
>>>   3 files changed, 25 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index cd080468ddc9..3d643449c48c 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1208,6 +1208,14 @@ static s64 update_se(struct rq *rq, struct 
>>> sched_entity *se)
>>>   __read_mostly unsigned int llc_overload_pct       = 50;
>>>   __read_mostly unsigned int llc_imb_pct            = 20;
>>> +DEFINE_STATIC_KEY_FALSE(sched_cache_allowed);
>>> +
>>> +static inline bool sched_cache_enabled(void)
>>> +{
>>> +    return sched_feat(SCHED_CACHE) &&
>>> +        static_branch_likely(&sched_cache_allowed);
>>> +}
>>
>> Urgh; do we really need _2_ static keys stacked for this? I'm thinking
>> one should be well enough.
> 
> SCHED_CACHE allows user space to turn on/off the feature at runtime,
> while sched_cache_allow is a hardware capability. This capability is

isn't it possible use only static_branch_likely(&sched_cache_allowed) at runtime?

Enable that key only if FEAT is set. Disable when unset.
That way you could use only one static branch at runtime.

Also, I am not sure if the FEATURE should be true by default. I know it maybe unused but
IMO it should be true by default only when its proven there are no regression.
One should be aware of their topology to enable it.
>   disabled if there are no multiple LLCs within one node. I’m not sure
> if using one key could support the above two scenarios.
>It is possible to have multiple NUMA nodes. One node may have multiple LLC while other
one may have only 1 LLC. what happens in that case?

I am yet to go through the series(hopefully this week). Maybe its handled already.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs
  2025-10-15 16:36       ` Shrikanth Hegde
@ 2025-10-15 17:01         ` Chen, Yu C
  2025-10-16  7:42           ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Chen, Yu C @ 2025-10-15 17:01 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Tim Chen,
	linux-kernel, Peter Zijlstra, Tim Chen

On 10/16/2025 12:36 AM, Shrikanth Hegde wrote:
> 
> 
> On 10/15/25 9:55 PM, Chen, Yu C wrote:
>> On 10/15/2025 7:04 PM, Peter Zijlstra wrote:
>>> On Sat, Oct 11, 2025 at 11:24:41AM -0700, Tim Chen wrote:
>>>> From: Chen Yu <yu.c.chen@intel.com>
>>>>
>>>> Enable cache-aware load balancing only if at least 1 NUMA node has
>>>> more than one LLC.
>>>>
>>>> Suggested-by: Libo Chen <libo.chen@oracle.com>
>>>> Suggested-by: Adam Li <adamli@os.amperecomputing.com>
>>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>>> ---
>>>>   kernel/sched/fair.c     | 15 ++++++++++++---
>>>>   kernel/sched/sched.h    |  1 +
>>>>   kernel/sched/topology.c | 14 ++++++++++++--
>>>>   3 files changed, 25 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index cd080468ddc9..3d643449c48c 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -1208,6 +1208,14 @@ static s64 update_se(struct rq *rq, struct 
>>>> sched_entity *se)
>>>>   __read_mostly unsigned int llc_overload_pct       = 50;
>>>>   __read_mostly unsigned int llc_imb_pct            = 20;
>>>> +DEFINE_STATIC_KEY_FALSE(sched_cache_allowed);
>>>> +
>>>> +static inline bool sched_cache_enabled(void)
>>>> +{
>>>> +    return sched_feat(SCHED_CACHE) &&
>>>> +        static_branch_likely(&sched_cache_allowed);
>>>> +}
>>>
>>> Urgh; do we really need _2_ static keys stacked for this? I'm thinking
>>> one should be well enough.
>>
>> SCHED_CACHE allows user space to turn on/off the feature at runtime,
>> while sched_cache_allow is a hardware capability. This capability is
> 
> isn't it possible use only static_branch_likely(&sched_cache_allowed) at 
> runtime?
> 
> Enable that key only if FEAT is set. Disable when unset.
> That way you could use only one static branch at runtime.
> 

Oh, do you mean only using sched_cache_allowed in sched_cache_enabled()?
I misunderstood that Peter suggested introducing only one key. I did
not quite catch up,  do you mean, we should monitor the switch of FEAT
and modify sched_cache_allowed when needed, and the OS only queries
the sched_cache_allowed  at runtime? I'll take a deeper look tomorrow.

Oh, do you mean only using sched_cache_allowed in sched_cache_enabled()?
I misunderstood that Peter suggested introducing only one key. But I didn't
quite catch up - do you mean we should monitor the switch of FEAT, modify
sched_cache_allowed when needed, and that the OS only queries 
sched_cache_allowed
at runtime?

> Also, I am not sure if the FEATURE should be true by default. I know it 
> maybe unused but
> IMO it should be true by default only when its proven there are no 
> regression.

Yes we tried very hard to not bring regressions during the past.

> One should be aware of their topology to enable it.
>>   disabled if there are no multiple LLCs within one node. I’m not sure
>> if using one key could support the above two scenarios.
>> It is possible to have multiple NUMA nodes. One node may have multiple 
>> LLC while other
> one may have only 1 LLC. what happens in that case?
> 

In this case, it will be enabled, and the cache-aware load balancing
will occur on that node with multiple LLCs. (Only a domain with
SD_SHARE_LLC set, and whose parent domain does not have SD_SHARE_LLC
set, will initiate the cache-aware load balancing.)


thanks,
Chenyu

> I am yet to go through the series(hopefully this week). Maybe its 
> handled already.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 00/19] Cache Aware Scheduling
  2025-10-15  5:38     ` Chen, Yu C
@ 2025-10-15 18:26       ` Madadi Vineeth Reddy
  2025-10-16  4:57         ` Chen, Yu C
  0 siblings, 1 reply; 116+ messages in thread
From: Madadi Vineeth Reddy @ 2025-10-15 18:26 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Libo Chen, Adam Li, Tim Chen, linux-kernel, haoxing990,
	Madadi Vineeth Reddy

On 15/10/25 11:08, Chen, Yu C wrote:
> On 10/15/2025 5:48 AM, Tim Chen wrote:
>> On Tue, 2025-10-14 at 17:43 +0530, Madadi Vineeth Reddy wrote:
>>> Hi Tim,
>>> Thanks for the patch.
>>>
>>> On 11/10/25 23:54, Tim Chen wrote:
> 
> [snip]
> 
>>>> [Genoa details]
>>>> [ChaCha20-xiangshan]
>>>> ChaCha20-xiangshan is a simple benchmark using a static build of an
>>>> 8-thread Verilator of XiangShan(RISC-V). The README file can be
>>>> found here[2]. The score depends on how aggressive the user set the
>>>> /sys/kernel/debug/sched/llc_aggr_tolerance. Using the default values,
>>>> there is no much difference observed. While setting the
>>>> /sys/kernel/debug/sched/llc_aggr_tolerance to 100, 44% improvment is
>>>> observed.
>>>>
>>>> baseline:
>>>> Host time spent: 50,868ms
>>>>
>>>> sched_cache:
>>>> Host time spent: 28,349ms
>>>>
>>>> The time has been reduced by 44%.
>>>
>>> Milan showed no improvement across all benchmarks, which could be due to the
>>> CCX topology (8 CCXs × 8 CPUs) where the LLC domain is too small for this
>>> optimization to be effective. Moreover there could be overhead due to additional
>>> computations.
>>>
>>> ChaCha20-xiangshan improvement in Genoa when llc_aggr_tolerance is set to 100 seems
>>> due to having relatively lesser thread count. Please provide the numbers
>>> with default values too. Would like to know numbers on varying loads.
>>
>> I'll ask Chen Yu who did the Xiangshan experiments if he has those numbers.
>>
> 
> Madadi, do you mean the performance score number or active thread number
>  when llc_aggr_tolerance is set to 1(default)?
> The score is around with sched_cache and llc_aggr_tolerance set to 1.
> The active number is 128 per process, and there are 8 processes when
> launching the benchmark. I suppose the 128 comes from the number
> of online CPUs. Please let me know if you need more data.
> 
> Cced Yangyu who's the author of this benchmark.

I mean the benchmark result with default value of llc_aggr_tolerance on Genoa
in comparison to baseline. Knowing number of threads also helps to understand
the impact. 

> 
> ls -l /proc/14460/task/ | grep -c '^d'
> 128
> 
>>>
>>> In Power 10 and Power 11, the LLC size is 4 threads which is even smaller. Not
>>> expecting improvements here but will run some workloads and share the data.
>>>
>>> Not gone through the entire series yet but are the situations like say in two
>>> NUMA system, if a task's preferred LLC is on the wrong NUMA node for its memory,
>>> which takes precedence?
>>
>> We take preferred NUMA node in the consideration but we do not force task to
>> go to the preferred node.
>>
>> I remembered initially we limited the consideration to only LLCs in the
>> preferred node. But we encountered regressions in hackbench and schbench,
>> because when the preferred node don't have any occupancy resulting in preferred LLC
>> to be set to -1 (no preference), and resulted in extra task migrations.
>> And also the preferred node for hackbench and schbench was volatile
>> as they have small memory footprint.  Chen Yu, please chime in if there
>> were other reasons you remembered.
>>
> 
> Since the preferred NUMA node is per task, while the preferred LLC
> is per process, scanning only the current task's preferred node
> would lead to cross-node migration. This is because the process's
> preferred LLC may not reside within the current task's preferred
> node. Such a scenario could leave curr_m_a_occ at 0, and any LLC
> with an occupancy > 0 would then trigger a preferred LLC switch.

Understood. Thanks for the context.

> 
>> We'll need to revisit this part of the code to take care of such
>> corner case. I think ideally we should move tasks to the least loaded LLC
>> in the preferred node (even if no LLCs have occupancy in the preferred node),
>> as long as preferred NUMA node don't changes too often.
>>
>>
> 
> Then we might need to introduce a new member in mm_struct to store the old
> occupancy, curr_m_a_occ, so that we can reliably compare the old and new
> occupancy - to avoid the 0 value of curr_m_a_occ.
> 
>>>
>>> Also, what about the workloads that don't share data like stress-ng?
>>>
> 
> The stream is single process stressing the memory without any share
> data, we did not observe any difference on stream. We can launch more
> tests on stress-ng.
> 

That would be helpful.

Thanks,
Madadi Vineeth Reddy

> thanks,
> Chenyu>
>> We can test those.  Ideally the controls to prevent over aggregation to preferred LLC
>> would keep stress-ng happy.
>>
>>> It will
>>> be good to make sure that most other workloads don't suffer. As mentioned,
>>> per process knob for llc_aggr_tolerance could help.
>>
>> Agree. We are planning to add per process knob for the next version.  One thought is to use
>> prctl. Any other suggestions are welcome.
>>
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing
  2025-10-15  4:54     ` Chen, Yu C
@ 2025-10-15 19:32       ` Tim Chen
  2025-10-16  3:11         ` Chen, Yu C
  0 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-15 19:32 UTC (permalink / raw)
  To: Chen, Yu C, Madadi Vineeth Reddy
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
	Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Tim Chen,
	linux-kernel, haoxing990

On Wed, 2025-10-15 at 12:54 +0800, Chen, Yu C wrote:
> On 10/15/2025 3:12 AM, Madadi Vineeth Reddy wrote:
> > On 11/10/25 23:54, Tim Chen wrote:
> > > From: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> > > 
> > > Cache-aware load balancing aims to aggregate tasks with potential
> > > shared resources into the same cache domain. This approach enhances
> > > cache locality, thereby optimizing system performance by reducing
> > > cache misses and improving data access efficiency.
> > > 
> 
> [snip]
> 
> > > +static void __no_profile task_cache_work(struct callback_head *work)
> > > +{
> > > +	struct task_struct *p = current;
> > > +	struct mm_struct *mm = p->mm;
> > > +	unsigned long m_a_occ = 0;
> > > +	unsigned long curr_m_a_occ = 0;
> > > +	int cpu, m_a_cpu = -1, cache_cpu,
> > > +	    pref_nid = NUMA_NO_NODE, curr_cpu;
> > > +	cpumask_var_t cpus;
> > > +
> > > +	WARN_ON_ONCE(work != &p->cache_work);
> > > +
> > > +	work->next = work;
> > > +
> > > +	if (p->flags & PF_EXITING)
> > > +		return;
> > > +
> > > +	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
> > > +		return;
> > > +
> > > +	curr_cpu = task_cpu(p);
> > > +	cache_cpu = mm->mm_sched_cpu;
> > > +#ifdef CONFIG_NUMA_BALANCING
> > > +	if (static_branch_likely(&sched_numa_balancing))
> > > +		pref_nid = p->numa_preferred_nid;
> > > +#endif
> > > +
> > > +	scoped_guard (cpus_read_lock) {
> > > +		get_scan_cpumasks(cpus, cache_cpu,
> > > +				  pref_nid, curr_cpu);
> > > +
> > 
> > IIUC, `get_scan_cpumasks` ORs together the preferred NUMA node, cache CPU's node,
> > and current CPU's node. This could result in scanning multiple nodes, not preferring
> > the NUMA preferred node.
> > 
> 
> Yes, it is possible, please see comments below.
> 
> > > +		for_each_cpu(cpu, cpus) {
> > > +			/* XXX sched_cluster_active */
> > > +			struct sched_domain *sd = per_cpu(sd_llc, cpu);
> > > +			unsigned long occ, m_occ = 0, a_occ = 0;
> > > +			int m_cpu = -1, i;
> > > +
> > > +			if (!sd)
> > > +				continue;
> > > +
> > > +			for_each_cpu(i, sched_domain_span(sd)) {
> > > +				occ = fraction_mm_sched(cpu_rq(i),
> > > +							per_cpu_ptr(mm->pcpu_sched, i));
> > > +				a_occ += occ;
> > > +				if (occ > m_occ) {
> > > +					m_occ = occ;
> > > +					m_cpu = i;
> > > +				}
> > > +			}
> > > +
> > > +			/*
> > > +			 * Compare the accumulated occupancy of each LLC. The
> > > +			 * reason for using accumulated occupancy rather than average
> > > +			 * per CPU occupancy is that it works better in asymmetric LLC
> > > +			 * scenarios.
> > > +			 * For example, if there are 2 threads in a 4CPU LLC and 3
> > > +			 * threads in an 8CPU LLC, it might be better to choose the one
> > > +			 * with 3 threads. However, this would not be the case if the
> > > +			 * occupancy is divided by the number of CPUs in an LLC (i.e.,
> > > +			 * if average per CPU occupancy is used).
> > > +			 * Besides, NUMA balancing fault statistics behave similarly:
> > > +			 * the total number of faults per node is compared rather than
> > > +			 * the average number of faults per CPU. This strategy is also
> > > +			 * followed here.
> > > +			 */
> > > +			if (a_occ > m_a_occ) {
> > > +				m_a_occ = a_occ;
> > > +				m_a_cpu = m_cpu;
> > > +			}
> > > +
> > > +			if (llc_id(cpu) == llc_id(mm->mm_sched_cpu))
> > > +				curr_m_a_occ = a_occ;
> > > +
> > > +			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
> > > +		}
> > 
> > This means NUMA preference has no effect on the selection, except in the
> > unlikely case of exactly equal occupancy across LLCs on different nodes
> > (where iteration order determines the winner).
> > 
> > How does it handle when cache locality and memory locality conflict?
> > Shouldn't numa preferred node get preference? Also scanning multiple
> > nodes add overhead, so can restricting it to numa preferred node be
> > better and scan others only when there is no numa preferred node?
> > 
> 
> Basically, yes, you're right. Ideally, we should prioritize the NUMA
> preferred node as the top priority. There's one case I find hard to
> handle: the NUMA preferred node is per task rather than per process.
> It's possible that different threads of the same process have different
> preferred nodes; as a result, the process-wide preferred LLC could bounce
> between different nodes, which might cause costly task migrations across
> nodes. As a workaround, we tried to keep the scan CPU mask covering the
> process's current preferred LLC to ensure the old preferred LLC is included
> in the candidates. After all, we have a 2X threshold for switching the
> preferred LLC.

If tasks in a process had different preferred nodes, they would
belong to different numa_groups, and majority of their data would
be from different NUMA nodes.

To resolve such conflict, we'll need to change the aggregation of tasks by
process, to aggregation of tasks by numa_group when NUMA balancing is
enabled.  This probably makes more sense as tasks in a numa_group
have more shared data and would benefit from co-locating in the
same cache.

Thanks.

Tim



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue
  2025-10-15 12:05   ` Peter Zijlstra
@ 2025-10-15 20:03     ` Tim Chen
  2025-10-16  7:44       ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-15 20:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Wed, 2025-10-15 at 14:05 +0200, Peter Zijlstra wrote:
> On Sat, Oct 11, 2025 at 11:24:44AM -0700, Tim Chen wrote:
> > @@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  		struct rq *rq = rq_of(cfs_rq);
> >  
> >  		account_numa_enqueue(rq, task_of(se));
> > +		account_llc_enqueue(rq, task_of(se));
> >  		list_add(&se->group_node, &rq->cfs_tasks);
> 
> Here and...
> 
> >  	}
> >  	cfs_rq->nr_queued++;
> > @@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  	update_load_sub(&cfs_rq->load, se->load.weight);
> >  	if (entity_is_task(se)) {
> >  		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
> > +		account_llc_dequeue(rq_of(cfs_rq), task_of(se));
> 
> ... here, could you please check the compiler is doing CSE of task_of()?

Will consolidate those task_of(se). 

> 
> >  		list_del_init(&se->group_node);
> >  	}
> >  	cfs_rq->nr_queued--;
> > +
> > +	/* safeguard to clear the cache aware data */
> > +	if (!parent_entity(se) && !cfs_rq->nr_queued)
> > +		reset_llc_stats(rq_of(cfs_rq));
> 
> I'm confused -- why?

Was put here in early code development to make
sure things would not go haywire.  Will remove
them.  Probably some warning of having tasks
preferring some LLC when there're no task in queue
is more appropriate.

Tim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 05/19] sched/fair: Add LLC index mapping for CPUs
  2025-10-15 11:58   ` Peter Zijlstra
@ 2025-10-15 20:12     ` Tim Chen
  0 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-15 20:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Wed, 2025-10-15 at 13:58 +0200, Peter Zijlstra wrote:
> On Sat, Oct 11, 2025 at 11:24:42AM -0700, Tim Chen wrote:
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 2675db980f70..4bd033060f1d 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -659,6 +659,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> >  DEFINE_PER_CPU(int, sd_llc_size);
> >  DEFINE_PER_CPU(int, sd_llc_id);
> > +DEFINE_PER_CPU(int, sd_llc_idx);
> >  DEFINE_PER_CPU(int, sd_share_id);
> >  DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> >  DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> 
> There is literally *ONE* user of sd_llc_id, cpus_share_cache(), surely
> that can equally use sd_llc_idx?
> 
> That is to say, do we really need two numbers for this?

Okay, will look into removing sd_llc_id

Tim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 08/19] sched/fair: Introduce per runqueue task LLC preference counter
  2025-10-15 12:21   ` Peter Zijlstra
@ 2025-10-15 20:41     ` Tim Chen
  2025-10-16  7:49       ` Peter Zijlstra
  2025-10-21  8:28       ` Madadi Vineeth Reddy
  0 siblings, 2 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-15 20:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Wed, 2025-10-15 at 14:21 +0200, Peter Zijlstra wrote:
> On Sat, Oct 11, 2025 at 11:24:45AM -0700, Tim Chen wrote:
> > Each runqueue is assigned a static array where each element tracks
> > the number of tasks preferring a given LLC, indexed from 0 to
> > NR_LLCS.
> > 
> > For example, rq->nr_pref_llc[3] = 2 signifies that there are 2 tasks on
> > this runqueue which prefer to run within LLC3 (indexed from 0 to NR_LLCS
> > 
> > The load balancer can use this information to identify busy runqueues
> > and migrate tasks to their preferred LLC domains.
> > 
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >  kernel/sched/fair.c  | 35 +++++++++++++++++++++++++++++++++++
> >  kernel/sched/sched.h |  1 +
> >  2 files changed, 36 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index fd315937c0cf..b7a68fe7601b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1235,22 +1235,51 @@ static inline int llc_idx(int cpu)
> >  	return per_cpu(sd_llc_idx, cpu);
> >  }
> >  
> > +static inline int pref_llc_idx(struct task_struct *p)
> > +{
> > +	return llc_idx(p->preferred_llc);
> > +}
> > +
> >  static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
> >  {
> > +	int pref_llc;
> > +
> >  	if (!sched_cache_enabled())
> >  		return;
> >  
> >  	rq->nr_llc_running += (p->preferred_llc != -1);
> >  	rq->nr_pref_llc_running += (p->preferred_llc == task_llc(p));
> > +
> > +	if (p->preferred_llc < 0)
> > +		return;
> > +
> > +	pref_llc = pref_llc_idx(p);
> > +	if (pref_llc < 0)
> > +		return;
> > +
> > +	++rq->nr_pref_llc[pref_llc];
> >  }
> >  
> >  static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
> >  {
> > +	int pref_llc;
> > +
> >  	if (!sched_cache_enabled())
> >  		return;
> >  
> >  	rq->nr_llc_running -= (p->preferred_llc != -1);
> >  	rq->nr_pref_llc_running -= (p->preferred_llc == task_llc(p));
> > +
> > +	if (p->preferred_llc < 0)
> > +		return;
> > +
> > +	pref_llc = pref_llc_idx(p);
> > +	if (pref_llc < 0)
> > +		return;
> > +
> > +	/* avoid negative counter */
> > +	if (rq->nr_pref_llc[pref_llc] > 0)
> > +		--rq->nr_pref_llc[pref_llc];
> 
> How!? Also, please use post increment/decrement operators.

Will change the rq->nr_pref_llc[pref_llc] <= 0 to a warning instead,
and update the decrement to post operator.

> 
> >  }
> >  
> >  void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
> > @@ -1524,10 +1553,16 @@ void init_sched_mm(struct task_struct *p)
> >  
> >  void reset_llc_stats(struct rq *rq)
> >  {
> > +	int i = 0;
> > +
> >  	if (!sched_cache_enabled())
> >  		return;
> >  
> >  	rq->nr_llc_running = 0;
> > +
> > +	for (i = 0; i < max_llcs; ++i)
> > +		rq->nr_pref_llc[i] = 0;
> > +
> >  	rq->nr_pref_llc_running = 0;
> >  }
> 
> Still don't understand why this thing exists..

Will remove this or change this to a debug
warning for the case when rq has no fair task.

> 
> >  
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 3ab64067acc6..b801d32d5fba 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1101,6 +1101,7 @@ struct rq {
> >  #ifdef CONFIG_SCHED_CACHE
> >  	unsigned int		nr_pref_llc_running;
> >  	unsigned int		nr_llc_running;
> > +	unsigned int		nr_pref_llc[NR_LLCS];
> 
> Gah, yeah, lets not do this. Just (re)alloc the thing on topology
> changes or something.

Will have to think about how to keep the tasks' preference
consistent with nr_pref_llc with the new array.  Perhaps
make it size of NR_CPUS so we will allocate
once and don't have to resize and reallocate it, and
fill it back up with the right data.

Tim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 09/19] sched/fair: Count tasks prefering each LLC in a sched group
  2025-10-15 12:22   ` Peter Zijlstra
@ 2025-10-15 20:42     ` Tim Chen
  0 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-15 20:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Wed, 2025-10-15 at 14:22 +0200, Peter Zijlstra wrote:
> On Sat, Oct 11, 2025 at 11:24:46AM -0700, Tim Chen wrote:
> > During LLC load balancing, tabulate the number of tasks on each runqueue
> > that prefer a given destination LLC in a sched group.
> > 
> > For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
> > balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
> > 2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
> > selected as the busiest source to pick tasks from.
> > 
> > Within a source LLC, the total number of tasks preferring a destination
> > LLC is computed by summing counts across all CPUs in that runqueue. For
> > instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
> > LLC3, the total for LLC0 is 3.
> > 
> > These statistics allow the load balancer to choose tasks from source
> > sched groups that best match their preferred LLCs.
> > 
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >  kernel/sched/fair.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b7a68fe7601b..cbd1e97bca4b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10399,6 +10399,9 @@ struct sg_lb_stats {
> >  	unsigned int nr_numa_running;
> >  	unsigned int nr_preferred_running;
> >  #endif
> > +#ifdef CONFIG_SCHED_CACHE
> > +	unsigned int nr_pref_llc[NR_LLCS];
> > +#endif
> >  };
> 
> Hahahaha, no! We have this on-stack, this cannot be.

Okay, will allocate it off stack.

Tim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 09/19] sched/fair: Count tasks prefering each LLC in a sched group
  2025-10-15 12:25   ` Peter Zijlstra
@ 2025-10-15 20:43     ` Tim Chen
  0 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-15 20:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Wed, 2025-10-15 at 14:25 +0200, Peter Zijlstra wrote:
> On Sat, Oct 11, 2025 at 11:24:46AM -0700, Tim Chen wrote:
> 
> > +#ifdef CONFIG_SCHED_CACHE
> > +		if (sched_cache_enabled()) {
> > +			int j;
> > +
> > +			for (j = 0; j < max_llcs; ++j)
> > +				sgs->nr_pref_llc[j] += rq->nr_pref_llc[j];
> 
> We live in the year 2025 and have embraced c99, please write as:
> 
> 	for (int j = 0; j < max_llcs; j++)

Will do.

> 
> > +		}
> > +#endif

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 11/19] sched/fair: Identify busiest sched_group for LLC-aware load balancing
  2025-10-15 15:24   ` Peter Zijlstra
@ 2025-10-15 21:18     ` Tim Chen
  0 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-15 21:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Wed, 2025-10-15 at 17:24 +0200, Peter Zijlstra wrote:
> On Sat, Oct 11, 2025 at 11:24:48AM -0700, Tim Chen wrote:
> 
> > @@ -11035,6 +11059,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  	     sds->local_stat.group_type != group_has_spare))
> >  		return false;
> >  
> > +	/* deal with prefer LLC load balance, if failed, fall into normal load balance */
> > +	if (update_llc_busiest(env, busiest, sgs))
> > +		return true;
> > +
> > +	/*
> > +	 * If the busiest group has tasks with LLC preference,
> > +	 * skip normal load balance.
> > +	 */
> > +	if (busiest->group_llc_balance)
> > +		return false;
> > +
> >  	if (sgs->group_type > busiest->group_type)
> >  		return true;
> 
> This feels weird.. should we really override things like group_imbalance
> or group_misfit_task ?

Probably make sense to move the group_llc_balance priority to be lower than group_imblanace
and group_misfit_task. Will update accordingly.

Tim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing
  2025-10-15 15:08   ` Peter Zijlstra
@ 2025-10-15 21:28     ` Tim Chen
  0 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-15 21:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Wed, 2025-10-15 at 17:08 +0200, Peter Zijlstra wrote:
> On Sat, Oct 11, 2025 at 11:24:47AM -0700, Tim Chen wrote:
> > During LLC load balancing, first check for tasks that prefer the
> > destination LLC and balance them to it before others.
> > 
> > Mark source sched groups containing tasks preferring non local LLCs
> > with the group_llc_balance flag. This ensures the load balancer later
> > pulls or pushes these tasks toward their preferred LLCs.
> > 
> > Co-developed-by: Chen Yu <yu.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> 
> For me this patch is cut too fine; it only sets group_llc_balance but
> then we don't see how it is used.

Okay, will combine this patch with the following one.

> 
> >  kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 41 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index cbd1e97bca4b..af7b578eaa06 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9822,8 +9822,7 @@ static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu
> >  	else
> >  		return mig_unrestricted;
> >  
> > -	return can_migrate_llc(src_cpu, dst_cpu,
> > -			       task_util(p), to_pref);
> > +	return can_migrate_llc(src_cpu, dst_cpu, task_util(p), to_pref);
> >  }
> >  
> >  #else
> > @@ -10394,6 +10393,7 @@ struct sg_lb_stats {
> >  	enum group_type group_type;
> >  	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
> >  	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
> > +	unsigned int group_llc_balance;		/* Tasks should be moved to preferred LLC */
> >  	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
> >  #ifdef CONFIG_NUMA_BALANCING
> >  	unsigned int nr_numa_running;
> > @@ -10849,11 +10849,45 @@ static void record_sg_llc_stats(struct lb_env *env,
> >  	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
> >  		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
> >  }
> > +
> > +/*
> > + * Do LLC balance on sched group that contains LLC, and have tasks preferring
> > + * to run on LLC in idle dst_cpu.
> > + */
> > +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> > +			       struct sched_group *group)
> > +{
> > +	struct sched_domain *child = env->sd->child;
> > +	int llc;
> > +
> > +	if (!sched_cache_enabled())
> > +		return false;
> > +
> > +	if (env->sd->flags & SD_SHARE_LLC)
> > +		return false;
> > +
> > +	/* only care about task migration among LLCs */
> > +	if (child && !(child->flags & SD_SHARE_LLC))
> > +		return false;
> > +
> > +	llc = llc_idx(env->dst_cpu);
> > +	if (sgs->nr_pref_llc[llc] > 0 &&
> 
> Nit: s/> 0// would be the same, right?

Sure.

> 
> > +	    can_migrate_llc(env->src_cpu, env->dst_cpu, 0, true) == mig_llc)
> > +		return true;
> > +
> > +	return false;
> > +}
> >  #else
> >  static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
> >  				       struct sched_group *group)
> >  {
> >  }
> > +
> > +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> > +			       struct sched_group *group)
> > +{
> > +	return false;
> > +}
> >  #endif
> >  
> >  /**
> > @@ -10954,6 +10988,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >  	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
> >  
> >  	record_sg_llc_stats(env, sgs, group);
> > +
> > +	/* Check for tasks in this group can be moved to their preferred LLC */
> > +	if (!local_group && llc_balance(env, sgs, group))
> > +		sgs->group_llc_balance = 1;
> 
> We now have 3 (or so) branches that start with:
> 
> 	if (!local_group &&
> 
> perhaps collate that some?

Sure.

> 
> > +
> >  	/* Computing avg_load makes sense only when group is overloaded */
> >  	if (sgs->group_type == group_overloaded)
> >  		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
> > -- 
> > 2.32.0
> > 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing
  2025-10-15 19:32       ` Tim Chen
@ 2025-10-16  3:11         ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-16  3:11 UTC (permalink / raw)
  To: Tim Chen, Madadi Vineeth Reddy
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
	Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Tim Chen,
	linux-kernel, haoxing990

On 10/16/2025 3:32 AM, Tim Chen wrote:
> On Wed, 2025-10-15 at 12:54 +0800, Chen, Yu C wrote:
>> On 10/15/2025 3:12 AM, Madadi Vineeth Reddy wrote:
>>> On 11/10/25 23:54, Tim Chen wrote:
>>>> From: "Peter Zijlstra (Intel)" <peterz@infradead.org>

[snip]

>>>
>>> How does it handle when cache locality and memory locality conflict?
>>> Shouldn't numa preferred node get preference? Also scanning multiple
>>> nodes add overhead, so can restricting it to numa preferred node be
>>> better and scan others only when there is no numa preferred node?
>>>
>>
>> Basically, yes, you're right. Ideally, we should prioritize the NUMA
>> preferred node as the top priority. There's one case I find hard to
>> handle: the NUMA preferred node is per task rather than per process.
>> It's possible that different threads of the same process have different
>> preferred nodes; as a result, the process-wide preferred LLC could bounce
>> between different nodes, which might cause costly task migrations across
>> nodes. As a workaround, we tried to keep the scan CPU mask covering the
>> process's current preferred LLC to ensure the old preferred LLC is included
>> in the candidates. After all, we have a 2X threshold for switching the
>> preferred LLC.
> 
> If tasks in a process had different preferred nodes, they would
> belong to different numa_groups, and majority of their data would
> be from different NUMA nodes.
> 
> To resolve such conflict, we'll need to change the aggregation of tasks by
> process, to aggregation of tasks by numa_group when NUMA balancing is
> enabled.  This probably makes more sense as tasks in a numa_group
> have more shared data and would benefit from co-locating in the
> same cache.
> 

Yes, this could be an enhancement if the NUMA balance is enabled.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/19] sched/fair: Assign preferred LLC ID to processes
  2025-10-15 11:15     ` Peter Zijlstra
@ 2025-10-16  3:13       ` Chen, Yu C
  2025-10-17  4:50       ` Chen, Yu C
  1 sibling, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-16  3:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vern Hao, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On 10/15/2025 7:15 PM, Peter Zijlstra wrote:
> On Tue, Oct 14, 2025 at 01:16:16PM +0800, Chen, Yu C wrote:
> 
>> The question becomes: how can we figure out the threads that share
>> data? Can the kernel detect this, or get the hint from user space?
> 
> This needs the PMU, then you can steer using cache-miss ratios. But then
> people will hate us for using counters.
> 
>> Yes, the numa_group in NUMA load balancing indicates
>> that several tasks manipulate the same page, which could be an
>> indicator. Besides, if task A frequently wakes up task B, does it
>> mean A and B have the potential to share data? Furthermore, if
>> task A wakes up B via a pipe, it might also indicate that A has
>> something to share with B. I just wonder if we can introduce a
>> structure to gather this information together.
> 
> The wakeup or pipe relation might be small relative to the working set.
> Consider a sharded in memory database, where the query comes in through
> the pipe/socket/wakeup. This query is small, but then it needs to go
> trawl through its memory to find the answer.
> 
> Something we *could* look at -- later -- is an interface to create
> thread groups, such that userspace that is clever enough can communicate
> this. But then there is the ago old question, will there be sufficient
> users to justify the maintenance of said interface.

OK, that could be an enhancement in the future if we encounter more
use cases.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/19] sched/fair: Avoid cache-aware scheduling for memory-heavy processes
  2025-10-15  6:57   ` kernel test robot
@ 2025-10-16  4:44     ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-16  4:44 UTC (permalink / raw)
  To: kernel test robot, Tim Chen
  Cc: oe-lkp, lkp, linux-kernel, aubrey.li, Peter Zijlstra, Ingo Molnar,
	K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Madadi Vineeth Reddy, Hillf Danton,
	Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Libo Chen, Adam Li,
	Tim Chen

On 10/15/2025 2:57 PM, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed "UBSAN:array-index-out-of-bounds_in_drivers/base/cacheinfo.c" on:

[snip]

> [   12.549731][   T83] ------------[ cut here ]------------
> [   12.550388][   T83] UBSAN: array-index-out-of-bounds in drivers/base/cacheinfo.c:37:9
> [   12.551060][   T83] index 4294967295 is out of range for type 'unsigned long[8]'
> [   12.551580][   T83] CPU: 0 UID: 0 PID: 83 Comm: systemd-journal Not tainted 6.17.0-rc4-00035-ge8b871200f11 #1 PREEMPTLAZY
> [   12.551585][   T83] Call Trace:
> [   12.551588][   T83]  __dump_stack (lib/dump_stack.c:95)
> [   12.551594][   T83]  dump_stack_lvl (lib/dump_stack.c:123)
> [   12.551601][   T83]  ubsan_epilogue.llvm.16751680356772289369 (lib/dump_stack.c:129 lib/ubsan.c:233)
> [   12.551607][   T83]  __ubsan_handle_out_of_bounds (lib/ubsan.c:?)
> [   12.551621][   T83]  get_cpu_cacheinfo (drivers/base/cacheinfo.c:?)
> [   12.551625][   T83]  exceed_llc_capacity (include/linux/cacheinfo.h:? kernel/sched/fair.c:1256)

Thanks 0day! It seems that for some reason the assignment of curr_cpu = 
task_cpu(p);
was after the check exceed_llc_capacity(), will fix this.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 00/19] Cache Aware Scheduling
  2025-10-15 18:26       ` Madadi Vineeth Reddy
@ 2025-10-16  4:57         ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-16  4:57 UTC (permalink / raw)
  To: Madadi Vineeth Reddy
  Cc: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, haoxing990

On 10/16/2025 2:26 AM, Madadi Vineeth Reddy wrote:
> On 15/10/25 11:08, Chen, Yu C wrote:
>> On 10/15/2025 5:48 AM, Tim Chen wrote:
>>> On Tue, 2025-10-14 at 17:43 +0530, Madadi Vineeth Reddy wrote:
>>>> Hi Tim,
>>>> Thanks for the patch.
>>>>
>>>> On 11/10/25 23:54, Tim Chen wrote:

[snip]

>>>> ChaCha20-xiangshan improvement in Genoa when llc_aggr_tolerance is set to 100 seems
>>>> due to having relatively lesser thread count. Please provide the numbers
>>>> with default values too. Would like to know numbers on varying loads.
>>>
>>> I'll ask Chen Yu who did the Xiangshan experiments if he has those numbers.
>>>
>>
>> Madadi, do you mean the performance score number or active thread number
>>   when llc_aggr_tolerance is set to 1(default)?
>> The score is around with sched_cache and llc_aggr_tolerance set to 1.
>> The active number is 128 per process, and there are 8 processes when
>> launching the benchmark. I suppose the 128 comes from the number
>> of online CPUs. Please let me know if you need more data.
>>
>> Cced Yangyu who's the author of this benchmark.
> 
> I mean the benchmark result with default value of llc_aggr_tolerance on Genoa
> in comparison to baseline. Knowing number of threads also helps to understand
> the impact.
> 

OK. Here are the full test script and corresponding data:
pepc pstates config --governor performance
pepc pstates config --turbo off
pepc cstates config --disable C2
echo 1 > /proc/sys/kernel/numa_balancing

echo NO_SCHED_CACHE > /sys/kernel/debug/sched/features
make run
sleep 5
sync

echo SCHED_CACHE > /sys/kernel/debug/sched/features
echo 50 > /sys/kernel/debug/sched/llc_overload_pct
echo 1 > /sys/kernel/debug/sched/llc_aggr_tolerance
make run

echo SCHED_CACHE > /sys/kernel/debug/sched/features
# to encourage task aggregation as much as possible
echo 100 > /sys/kernel/debug/sched/llc_overload_pct
echo 100 > /sys/kernel/debug/sched/llc_aggr_tolerance
make run


# ./launch.sh

Host time spent: 51,323ms //baseline
Host time spent: 51,741ms //sched_cache default
Host time spent: 27,934ms //sched_cache aggressive

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs
  2025-10-15 16:25     ` Chen, Yu C
  2025-10-15 16:36       ` Shrikanth Hegde
@ 2025-10-16  7:40       ` Peter Zijlstra
  1 sibling, 0 replies; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-16  7:40 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel

On Thu, Oct 16, 2025 at 12:25:27AM +0800, Chen, Yu C wrote:
> On 10/15/2025 7:04 PM, Peter Zijlstra wrote:
> > On Sat, Oct 11, 2025 at 11:24:41AM -0700, Tim Chen wrote:
> > > From: Chen Yu <yu.c.chen@intel.com>
> > > 
> > > Enable cache-aware load balancing only if at least 1 NUMA node has
> > > more than one LLC.
> > > 
> > > Suggested-by: Libo Chen <libo.chen@oracle.com>
> > > Suggested-by: Adam Li <adamli@os.amperecomputing.com>
> > > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > ---
> > >   kernel/sched/fair.c     | 15 ++++++++++++---
> > >   kernel/sched/sched.h    |  1 +
> > >   kernel/sched/topology.c | 14 ++++++++++++--
> > >   3 files changed, 25 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index cd080468ddc9..3d643449c48c 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -1208,6 +1208,14 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
> > >   __read_mostly unsigned int llc_overload_pct       = 50;
> > >   __read_mostly unsigned int llc_imb_pct            = 20;
> > > +DEFINE_STATIC_KEY_FALSE(sched_cache_allowed);
> > > +
> > > +static inline bool sched_cache_enabled(void)
> > > +{
> > > +	return sched_feat(SCHED_CACHE) &&
> > > +		static_branch_likely(&sched_cache_allowed);
> > > +}
> > 
> > Urgh; do we really need _2_ static keys stacked for this? I'm thinking
> > one should be well enough.
> 
> SCHED_CACHE allows user space to turn on/off the feature at runtime,
> while sched_cache_allow is a hardware capability. This capability is
>  disabled if there are no multiple LLCs within one node. I’m not sure
> if using one key could support the above two scenarios.

Of course it can! There is one decision 'is cache aware crap enabled',
you only need one branch for that. You just need to make the code that
sets the branch state a little more complicated.

Things like sysctl/kernel/sched_schedstats
sysctl/kernel/sched_energy_aware debugfs/sched/numa_balancing all work
like this.

Take the sched_energy_aware one, that is very similar; it will only
enable if the topology supports energy aware stuff. But then still lets
userspace disable it.

Same for this. The static branch condition should be:
'topology-has-multi-llc && userspace-wants-it'.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs
  2025-10-15 17:01         ` Chen, Yu C
@ 2025-10-16  7:42           ` Peter Zijlstra
  2025-10-17  2:08             ` Chen, Yu C
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-16  7:42 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Shrikanth Hegde, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Tim Chen,
	linux-kernel, Tim Chen

On Thu, Oct 16, 2025 at 01:01:05AM +0800, Chen, Yu C wrote:

> Oh, do you mean only using sched_cache_allowed in sched_cache_enabled()?
> I misunderstood that Peter suggested introducing only one key. But I didn't
> quite catch up - do you mean we should monitor the switch of FEAT, modify
> sched_cache_allowed when needed, and that the OS only queries
> sched_cache_allowed
> at runtime?

Just don't use sched_feat(), add a debugfs file like numa_balancing and
then combine the userspace and topology information when setting the one
static_branch.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue
  2025-10-15 20:03     ` Tim Chen
@ 2025-10-16  7:44       ` Peter Zijlstra
  2025-10-16 20:06         ` Tim Chen
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-16  7:44 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Wed, Oct 15, 2025 at 01:03:37PM -0700, Tim Chen wrote:
> On Wed, 2025-10-15 at 14:05 +0200, Peter Zijlstra wrote:
> > On Sat, Oct 11, 2025 at 11:24:44AM -0700, Tim Chen wrote:
> > > @@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > >  		struct rq *rq = rq_of(cfs_rq);
> > >  
> > >  		account_numa_enqueue(rq, task_of(se));
> > > +		account_llc_enqueue(rq, task_of(se));
> > >  		list_add(&se->group_node, &rq->cfs_tasks);
> > 
> > Here and...
> > 
> > >  	}
> > >  	cfs_rq->nr_queued++;
> > > @@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > >  	update_load_sub(&cfs_rq->load, se->load.weight);
> > >  	if (entity_is_task(se)) {
> > >  		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
> > > +		account_llc_dequeue(rq_of(cfs_rq), task_of(se));
> > 
> > ... here, could you please check the compiler is doing CSE of task_of()?
> 
> Will consolidate those task_of(se). 

And rq_of(). But really, check code-gen, it *should* DTRT and CSE the
lot. If it doesn't, then do it manually.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 08/19] sched/fair: Introduce per runqueue task LLC preference counter
  2025-10-15 20:41     ` Tim Chen
@ 2025-10-16  7:49       ` Peter Zijlstra
  2025-10-21  8:28       ` Madadi Vineeth Reddy
  1 sibling, 0 replies; 116+ messages in thread
From: Peter Zijlstra @ 2025-10-16  7:49 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Wed, Oct 15, 2025 at 01:41:42PM -0700, Tim Chen wrote:

> > > +	/* avoid negative counter */
> > > +	if (rq->nr_pref_llc[pref_llc] > 0)
> > > +		--rq->nr_pref_llc[pref_llc];
> > 
> > How!? Also, please use post increment/decrement operators.
> 
> Will change the rq->nr_pref_llc[pref_llc] <= 0 to a warning instead,
> and update the decrement to post operator.

That WARN will still add code. Note how none of the nr_*_running
decrements have checks on. You fundamentally cannot remove a task from
the runqueue that hasn't first been enqueued.

If you get mis-matches something is *very* busted.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue
  2025-10-16  7:44       ` Peter Zijlstra
@ 2025-10-16 20:06         ` Tim Chen
  0 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-16 20:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Thu, 2025-10-16 at 09:44 +0200, Peter Zijlstra wrote:
> On Wed, Oct 15, 2025 at 01:03:37PM -0700, Tim Chen wrote:
> > On Wed, 2025-10-15 at 14:05 +0200, Peter Zijlstra wrote:
> > > On Sat, Oct 11, 2025 at 11:24:44AM -0700, Tim Chen wrote:
> > > > @@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > > >  		struct rq *rq = rq_of(cfs_rq);
> > > >  
> > > >  		account_numa_enqueue(rq, task_of(se));
> > > > +		account_llc_enqueue(rq, task_of(se));
> > > >  		list_add(&se->group_node, &rq->cfs_tasks);
> > > 
> > > Here and...
> > > 
> > > >  	}
> > > >  	cfs_rq->nr_queued++;
> > > > @@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > > >  	update_load_sub(&cfs_rq->load, se->load.weight);
> > > >  	if (entity_is_task(se)) {
> > > >  		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
> > > > +		account_llc_dequeue(rq_of(cfs_rq), task_of(se));
> > > 
> > > ... here, could you please check the compiler is doing CSE of task_of()?
> > 
> > Will consolidate those task_of(se). 
> 
> And rq_of(). But really, check code-gen, it *should* DTRT and CSE the
> lot. If it doesn't, then do it manually.

Looking at the assembly dump, it does look like compiler is doing
the right thing and not computing those twice.

Tim

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs
  2025-10-16  7:42           ` Peter Zijlstra
@ 2025-10-17  2:08             ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-17  2:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Shrikanth Hegde, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Tim Chen,
	linux-kernel, Tim Chen

On 10/16/2025 3:42 PM, Peter Zijlstra wrote:
> On Thu, Oct 16, 2025 at 01:01:05AM +0800, Chen, Yu C wrote:
> 
>> Oh, do you mean only using sched_cache_allowed in sched_cache_enabled()?
>> I misunderstood that Peter suggested introducing only one key. But I didn't
>> quite catch up - do you mean we should monitor the switch of FEAT, modify
>> sched_cache_allowed when needed, and that the OS only queries
>> sched_cache_allowed
>> at runtime?
> 
> Just don't use sched_feat(), add a debugfs file like numa_balancing and
> then combine the userspace and topology information when setting the one
> static_branch.

Got it, will do like this.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/19] sched/fair: Assign preferred LLC ID to processes
  2025-10-15 11:15     ` Peter Zijlstra
  2025-10-16  3:13       ` Chen, Yu C
@ 2025-10-17  4:50       ` Chen, Yu C
  2025-10-20  9:41         ` Vern Hao
  1 sibling, 1 reply; 116+ messages in thread
From: Chen, Yu C @ 2025-10-17  4:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vern Hao, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Tim Chen, linux-kernel, haoxing990

On 10/15/2025 7:15 PM, Peter Zijlstra wrote:
> On Tue, Oct 14, 2025 at 01:16:16PM +0800, Chen, Yu C wrote:
> 
>> The question becomes: how can we figure out the threads that share
>> data? Can the kernel detect this, or get the hint from user space?
> 
> This needs the PMU, then you can steer using cache-miss ratios. But then
> people will hate us for using counters.
> 
>> Yes, the numa_group in NUMA load balancing indicates
>> that several tasks manipulate the same page, which could be an
>> indicator. Besides, if task A frequently wakes up task B, does it
>> mean A and B have the potential to share data? Furthermore, if
>> task A wakes up B via a pipe, it might also indicate that A has
>> something to share with B. I just wonder if we can introduce a
>> structure to gather this information together.
> 
> The wakeup or pipe relation might be small relative to the working set.
> Consider a sharded in memory database, where the query comes in through
> the pipe/socket/wakeup. This query is small, but then it needs to go
> trawl through its memory to find the answer.
> 
> Something we *could* look at -- later -- is an interface to create
> thread groups, such that userspace that is clever enough can communicate
> this. But then there is the ago old question, will there be sufficient
> users to justify the maintenance of said interface.

I did not intend to digress too far, but since this issue has been brought
up, a wild guess came to me - could the "interface to create thread groups"
here refer to something like the filesystem for memory cgroup
v2 thread mode? I just heard that some cloud users might split the threads
of a single process into different thread groups, where threads within each
group share data with one another (for example, when performing K-V hashing
operations). Using cgroup for this purpose might be a bit overkill, though,
considering that cgroup itself is designed for resource partitioning rather
than identifying tasks sharing data. Meanwhile, the hierarchy of cgroup
could also cause some overhead. If there were a single-layer thread 
partitioning
mechanism - similar to the resctrl filesystem - wouldn’t that allow us 
to avoid
modifying too much user business code while minimizing coupling with 
existing
kernel components?

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/19] sched/fair: Assign preferred LLC ID to processes
  2025-10-17  4:50       ` Chen, Yu C
@ 2025-10-20  9:41         ` Vern Hao
  0 siblings, 0 replies; 116+ messages in thread
From: Vern Hao @ 2025-10-20  9:41 UTC (permalink / raw)
  To: Chen, Yu C, Peter Zijlstra
  Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vern Hao, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Tim Chen, linux-kernel


On 2025/10/17 12:50, Chen, Yu C wrote:
> On 10/15/2025 7:15 PM, Peter Zijlstra wrote:
>> On Tue, Oct 14, 2025 at 01:16:16PM +0800, Chen, Yu C wrote:
>>
>>> The question becomes: how can we figure out the threads that share
>>> data? Can the kernel detect this, or get the hint from user space?
>>
>> This needs the PMU, then you can steer using cache-miss ratios. But then
>> people will hate us for using counters.
>>
>>> Yes, the numa_group in NUMA load balancing indicates
>>> that several tasks manipulate the same page, which could be an
>>> indicator. Besides, if task A frequently wakes up task B, does it
>>> mean A and B have the potential to share data? Furthermore, if
>>> task A wakes up B via a pipe, it might also indicate that A has
>>> something to share with B. I just wonder if we can introduce a
>>> structure to gather this information together.
>>
>> The wakeup or pipe relation might be small relative to the working set.
>> Consider a sharded in memory database, where the query comes in through
>> the pipe/socket/wakeup. This query is small, but then it needs to go
>> trawl through its memory to find the answer.
>>
>> Something we *could* look at -- later -- is an interface to create
>> thread groups, such that userspace that is clever enough can communicate
>> this. But then there is the ago old question, will there be sufficient
>> users to justify the maintenance of said interface.
>
> I did not intend to digress too far, but since this issue has been 
> brought
> up, a wild guess came to me - could the "interface to create thread 
> groups"
> here refer to something like the filesystem for memory cgroup
> v2 thread mode? I just heard that some cloud users might split the 
> threads
> of a single process into different thread groups, where threads within 
> each
> group share data with one another (for example, when performing K-V 
> hashing
> operations). 

Yes, in our internal business, we encountered similar issues. The actual 
scenario is on AMD virtual machines,

where businesses would spawn multiple concurrent threads, for example, 
around 900 threads, with over 600 threads

handling hash or key-value computations, more than 100 threads dealing 
with network transmission, and some others handling

background logging or monitoring. These threads do not share same hot L3 
cache data. so concentrating these threads would only

exacerbate contention.


Can we differentiate these types of threads? It's obvious that the 
current configuration approach cannot meet the requirements

and will only cause more L3 cache race. Can we use cgroup or other 
methods, or configure through system calls to make

distinctions (the application may not be willing to modify the code) ?

> Using cgroup for this purpose might be a bit overkill, though,
> considering that cgroup itself is designed for resource partitioning 
> rather
> than identifying tasks sharing data. Meanwhile, the hierarchy of cgroup
> could also cause some overhead. If there were a single-layer thread 
> partitioning
> mechanism - similar to the resctrl filesystem - wouldn’t that allow us 
> to avoid
> modifying too much user business code while minimizing coupling with 
> existing
> kernel components?



> thanks,
> Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 08/19] sched/fair: Introduce per runqueue task LLC preference counter
  2025-10-15 20:41     ` Tim Chen
  2025-10-16  7:49       ` Peter Zijlstra
@ 2025-10-21  8:28       ` Madadi Vineeth Reddy
  2025-10-23  6:07         ` Chen, Yu C
  1 sibling, 1 reply; 116+ messages in thread
From: Madadi Vineeth Reddy @ 2025-10-21  8:28 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
	Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Chen Yu, Libo Chen,
	Adam Li, Tim Chen, linux-kernel, Madadi Vineeth Reddy

On 16/10/25 02:11, Tim Chen wrote:
> On Wed, 2025-10-15 at 14:21 +0200, Peter Zijlstra wrote:
>> On Sat, Oct 11, 2025 at 11:24:45AM -0700, Tim Chen wrote:
>>> Each runqueue is assigned a static array where each element tracks
>>> the number of tasks preferring a given LLC, indexed from 0 to
>>> NR_LLCS.
>>>

[snip]

>>
>>>  
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 3ab64067acc6..b801d32d5fba 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -1101,6 +1101,7 @@ struct rq {
>>>  #ifdef CONFIG_SCHED_CACHE
>>>  	unsigned int		nr_pref_llc_running;
>>>  	unsigned int		nr_llc_running;
>>> +	unsigned int		nr_pref_llc[NR_LLCS];
>>
>> Gah, yeah, lets not do this. Just (re)alloc the thing on topology
>> changes or something.
> 
> Will have to think about how to keep the tasks' preference
> consistent with nr_pref_llc with the new array.  Perhaps
> make it size of NR_CPUS so we will allocate
> once and don't have to resize and reallocate it, and
> fill it back up with the right data.
> 
> Tim

IIUC, what Peter meant is to dynamically allocate the array size based on
the actual number of LLCs computed in build_sched_domains() or somesuch
rather than statically allocating NR_LLCS which is 64 by default.

Making it NR_CPUS would be even larger and waste more memory on systems
with few LLCs.

Thanks,
Madadi Vineeth Reddy

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 17/19] sched/fair: Disable cache aware scheduling for processes with high thread counts
  2025-10-11 18:24 ` [PATCH 17/19] sched/fair: Disable cache aware scheduling for processes with high thread counts Tim Chen
@ 2025-10-22 17:21   ` Madadi Vineeth Reddy
  2025-10-23  6:55     ` Chen, Yu C
  0 siblings, 1 reply; 116+ messages in thread
From: Madadi Vineeth Reddy @ 2025-10-22 17:21 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Chen Yu, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Libo Chen, Adam Li, Tim Chen, linux-kernel, Madadi Vineeth Reddy

On 11/10/25 23:54, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
> 
> If the number of active threads within the process
> exceeds the number of Cores(divided by SMTs number)
> in the LLC, do not enable cache-aware scheduling.
> This is because there is a risk of cache contention
> within the preferred LLC when too many threads are
> present.
> 
> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c | 27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 79d109f8a09f..6b8eace79eee 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1240,6 +1240,18 @@ static inline int pref_llc_idx(struct task_struct *p)
>  	return llc_idx(p->preferred_llc);
>  }
>  
> +static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
> +{
> +	int smt_nr = 1;
> +
> +#ifdef CONFIG_SCHED_SMT
> +	if (sched_smt_active())
> +		smt_nr = cpumask_weight(cpu_smt_mask(cpu));
> +#endif
> +
> +	return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));

In Power10 and Power11 that has SMT8 and LLC size of 4, this would disable
cache aware scheduling even for one thread.

Also, llc_overload_pct already ensures the load on the  preferred LLC doesn't
exceed certain capacity. Why is this exceed_llc_nr() check needed? Won't the
existing overload_pct naturally prevent excessive task aggregation by blocking
migrations when the destination LLC reaches ~50% utilization?

Thanks,
Madadi Vineeth Reddy

> +}
> +
>  static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
>  {
>  	int pref_llc;
> @@ -1385,10 +1397,12 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>  
>  	/*
>  	 * If this task hasn't hit task_cache_work() for a while, or it
> -	 * has only 1 thread, invalidate its preferred state.
> +	 * has only 1 thread, or has too many active threads, invalidate
> +	 * its preferred state.
>  	 */
>  	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
> -	    get_nr_threads(p) <= 1) {
> +	    get_nr_threads(p) <= 1 ||
> +	    exceed_llc_nr(mm, cpu_of(rq))) {
>  		if (mm->mm_sched_cpu != -1)
>  			mm->mm_sched_cpu = -1;
>  	}
> @@ -1467,6 +1481,11 @@ static void __no_profile task_cache_work(struct callback_head *work)
>  	if (p->flags & PF_EXITING)
>  		return;
>  
> +	if (get_nr_threads(p) <= 1) {
> +		mm->mm_sched_cpu = -1;
> +		return;
> +	}
> +
>  	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
>  		return;
>  
> @@ -9826,6 +9845,10 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
>  	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
>  		return mig_unrestricted;
>  
> +	 /* skip cache aware load balance for single/too many threads */
> +	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
> +		return mig_unrestricted;
> +
>  	if (cpus_share_cache(dst_cpu, cpu))
>  		to_pref = true;
>  	else if (cpus_share_cache(src_cpu, cpu))


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 08/19] sched/fair: Introduce per runqueue task LLC preference counter
  2025-10-21  8:28       ` Madadi Vineeth Reddy
@ 2025-10-23  6:07         ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-23  6:07 UTC (permalink / raw)
  To: Madadi Vineeth Reddy, Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
	Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Tim Chen,
	linux-kernel, haoxing990

On 10/21/2025 4:28 PM, Madadi Vineeth Reddy wrote:
> On 16/10/25 02:11, Tim Chen wrote:
>> On Wed, 2025-10-15 at 14:21 +0200, Peter Zijlstra wrote:
>>> On Sat, Oct 11, 2025 at 11:24:45AM -0700, Tim Chen wrote:
>>>> Each runqueue is assigned a static array where each element tracks
>>>> the number of tasks preferring a given LLC, indexed from 0 to
>>>> NR_LLCS.
>>>>
> 
> [snip]
> 
>>>
>>>>   
>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>> index 3ab64067acc6..b801d32d5fba 100644
>>>> --- a/kernel/sched/sched.h
>>>> +++ b/kernel/sched/sched.h
>>>> @@ -1101,6 +1101,7 @@ struct rq {
>>>>   #ifdef CONFIG_SCHED_CACHE
>>>>   	unsigned int		nr_pref_llc_running;
>>>>   	unsigned int		nr_llc_running;
>>>> +	unsigned int		nr_pref_llc[NR_LLCS];
>>>
>>> Gah, yeah, lets not do this. Just (re)alloc the thing on topology
>>> changes or something.
>>
>> Will have to think about how to keep the tasks' preference
>> consistent with nr_pref_llc with the new array.  Perhaps
>> make it size of NR_CPUS so we will allocate
>> once and don't have to resize and reallocate it, and
>> fill it back up with the right data.
>>
>> Tim
> 
> IIUC, what Peter meant is to dynamically allocate the array size based on
> the actual number of LLCs computed in build_sched_domains() or somesuch
> rather than statically allocating NR_LLCS which is 64 by default.
> 

OK, this might involve dynamic adjustment/data synchronization of the
nr_pref_llc buffer during CPU hotplug, and we'll make some modifications 
to it.

thanks,
Chenyu

> Making it NR_CPUS would be even larger and waste more memory on systems
> with few LLCs.
> 
> Thanks,
> Madadi Vineeth Reddy

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 17/19] sched/fair: Disable cache aware scheduling for processes with high thread counts
  2025-10-22 17:21   ` Madadi Vineeth Reddy
@ 2025-10-23  6:55     ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-23  6:55 UTC (permalink / raw)
  To: Madadi Vineeth Reddy, Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Hillf Danton,
	Shrikanth Hegde, Jianyong Wu, Yangyu Chen, Tingyin Duan, Vern Hao,
	Len Brown, Aubrey Li, Zhao Liu, Chen Yu, Adam Li, Tim Chen,
	linux-kernel, haoxing990

On 10/23/2025 1:21 AM, Madadi Vineeth Reddy wrote:
> On 11/10/25 23:54, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> If the number of active threads within the process
>> exceeds the number of Cores(divided by SMTs number)
>> in the LLC, do not enable cache-aware scheduling.
>> This is because there is a risk of cache contention
>> within the preferred LLC when too many threads are
>> present.
>>
>> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>   kernel/sched/fair.c | 27 +++++++++++++++++++++++++--
>>   1 file changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 79d109f8a09f..6b8eace79eee 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1240,6 +1240,18 @@ static inline int pref_llc_idx(struct task_struct *p)
>>   	return llc_idx(p->preferred_llc);
>>   }
>>   
>> +static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>> +{
>> +	int smt_nr = 1;
>> +
>> +#ifdef CONFIG_SCHED_SMT
>> +	if (sched_smt_active())
>> +		smt_nr = cpumask_weight(cpu_smt_mask(cpu));
>> +#endif
>> +
>> +	return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
> 
> In Power10 and Power11 that has SMT8 and LLC size of 4, this would disable
> cache aware scheduling even for one thread.
> 

Using smt_nr was mainly due to concerns about introducing regressions
on Power, as discussed in v3
https://lore.kernel.org/all/8f6c7c69-b6b3-4c82-8db3-96757f09245f@linux.ibm.com/
and
https://lore.kernel.org/all/ddb9d558-d114-41db-9d4b-296fc2ecdbb4@linux.ibm.com/

It seems that aggregating tasks on an LLC with many SMT
threads/smaller LLC size would pose a risk of cache
contention. Additionally, with patch [19/19], users can tune
/sys/kernel/debug/sched/llc_aggr_tolerance to adjust the threshold:

return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, 
cpu)));

> Also, llc_overload_pct already ensures the load on the  preferred LLC doesn't
> exceed certain capacity. Why is this exceed_llc_nr() check needed? Won't the
> existing overload_pct naturally prevent excessive task aggregation by blocking
> migrations when the destination LLC reaches ~50% utilization?
> 

Using exceed_llc_nr() was because some short-duration tasks could
  generate low utilization but still cause cache contention (for
some reason, the util_avg cannot track that properly), such as
schbench. Therefore, we inhibit task aggregation for a large number
of active threads.


thanks,
Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 16/19] sched/fair: Exclude processes with many threads from cache-aware scheduling
  2025-10-11 18:24 ` [PATCH 16/19] sched/fair: Exclude processes with many threads from cache-aware scheduling Tim Chen
@ 2025-10-23  7:22   ` kernel test robot
  0 siblings, 0 replies; 116+ messages in thread
From: kernel test robot @ 2025-10-23  7:22 UTC (permalink / raw)
  To: Tim Chen
  Cc: oe-lkp, lkp, K Prateek Nayak, Tim Chen, linux-mm, linux-kernel,
	aubrey.li, yu.c.chen, Peter Zijlstra, Ingo Molnar,
	Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Libo Chen, Adam Li, Tim Chen, oliver.sang



Hello,

kernel test robot noticed a 2.1% regression of will-it-scale.per_thread_ops on:


commit: cb57b28051ef1d84e7cb14db4e1ab99b4f33b4b5 ("[PATCH 16/19] sched/fair: Exclude processes with many threads from cache-aware scheduling")
url: https://github.com/intel-lab-lkp/linux/commits/Tim-Chen/sched-fair-Add-infrastructure-for-cache-aware-load-balancing/20251012-022248
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 45b7f780739a3145aeef24d2dfa02517a6c82ed6
patch link: https://lore.kernel.org/all/637cdb8ab11b1b978d697ed744cc402d32443ecc.1760206683.git.tim.c.chen@linux.intel.com/
patch subject: [PATCH 16/19] sched/fair: Exclude processes with many threads from cache-aware scheduling

testcase: will-it-scale
config: x86_64-rhel-9.4
compiler: gcc-14
test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 64G memory
parameters:

	nr_task: 100%
	mode: thread
	test: tlb_flush2
	cpufreq_governor: performance




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202510231406.30bc8aec-lkp@intel.com


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251023/202510231406.30bc8aec-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-14/performance/x86_64-rhel-9.4/thread/100%/debian-13-x86_64-20250902.cgz/lkp-ivb-2ep2/tlb_flush2/will-it-scale

commit: 
  4ac141e433 ("sched/fair: Respect LLC preference in task migration and detach")
  cb57b28051 ("sched/fair: Exclude processes with many threads from cache-aware scheduling")

4ac141e4330723c0 cb57b28051ef1d84e7cb14db4e1 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
   1482496            -2.1%    1451299        will-it-scale.48.threads
     30884            -2.1%      30235        will-it-scale.per_thread_ops
   1482496            -2.1%    1451299        will-it-scale.workload
 4.447e+08            -2.1%  4.355e+08        proc-vmstat.numa_hit
 4.447e+08            -2.1%  4.355e+08        proc-vmstat.numa_local
 4.447e+08            -2.1%  4.354e+08        proc-vmstat.pgalloc_normal
 8.884e+08            -2.1%  8.698e+08        proc-vmstat.pgfault
 4.446e+08            -2.1%  4.353e+08        proc-vmstat.pgfree
 6.446e+09            -2.0%  6.318e+09        perf-stat.i.branch-instructions
 1.462e+08            -1.4%  1.441e+08        perf-stat.i.branch-misses
 1.467e+08            -1.6%  1.444e+08        perf-stat.i.cache-misses
 7.692e+08            -1.4%  7.587e+08        perf-stat.i.cache-references
    101348 ±  2%      +3.6%     104965        perf-stat.i.context-switches
      4.14            +1.9%       4.22        perf-stat.i.cpi
    883.41            +1.4%     896.20        perf-stat.i.cycles-between-cache-misses
 3.083e+10            -2.0%  3.022e+10        perf-stat.i.instructions
      0.24            -1.8%       0.24        perf-stat.i.ipc
    124.71            -2.0%     122.18        perf-stat.i.metric.K/sec
   2944589            -2.1%    2882055        perf-stat.i.minor-faults
   2944589            -2.1%    2882055        perf-stat.i.page-faults
      4.17            +1.9%       4.25        perf-stat.overall.cpi
    876.76            +1.5%     889.96        perf-stat.overall.cycles-between-cache-misses
      0.24            -1.8%       0.24        perf-stat.overall.ipc
 6.417e+09            -2.0%   6.29e+09        perf-stat.ps.branch-instructions
 1.455e+08            -1.4%  1.434e+08        perf-stat.ps.branch-misses
  1.46e+08            -1.6%  1.436e+08        perf-stat.ps.cache-misses
 7.653e+08            -1.4%  7.549e+08        perf-stat.ps.cache-references
    100692 ±  2%      +3.6%     104309        perf-stat.ps.context-switches
 3.069e+10            -2.0%  3.008e+10        perf-stat.ps.instructions
   2931887            -2.1%    2869944        perf-stat.ps.minor-faults
   2931887            -2.1%    2869944        perf-stat.ps.page-faults
 9.273e+12            -1.9%  9.096e+12        perf-stat.total.instructions
     62.03            -1.8       60.18        perf-profile.calltrace.cycles-pp.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu.do_madvise.__x64_sys_madvise
     63.66            -1.8       61.82        perf-profile.calltrace.cycles-pp.flush_tlb_mm_range.tlb_finish_mmu.do_madvise.__x64_sys_madvise.do_syscall_64
     61.19            -1.8       59.36        perf-profile.calltrace.cycles-pp.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu.do_madvise
     65.49            -1.7       63.79        perf-profile.calltrace.cycles-pp.tlb_finish_mmu.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe
     75.54            -1.5       74.02        perf-profile.calltrace.cycles-pp.__madvise
     71.89            -1.5       70.41        perf-profile.calltrace.cycles-pp.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
     72.40            -1.5       70.92        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__madvise
     71.83            -1.5       70.35        perf-profile.calltrace.cycles-pp.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
     72.35            -1.5       70.87        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__madvise
     15.31            -0.6       14.70        perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu
     12.04            -0.5       11.52        perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range
     10.97            -0.5       10.47        perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond
     11.08            -0.5       10.58        perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask
      4.36            -0.2        4.15        perf-profile.calltrace.cycles-pp.flush_tlb_func.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function
      4.53            -0.2        4.34        perf-profile.calltrace.cycles-pp.llist_reverse_order.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function
      2.02 ±  2%      -0.1        1.95        perf-profile.calltrace.cycles-pp.lock_vma_under_rcu.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
      1.47            -0.1        1.42        perf-profile.calltrace.cycles-pp.folio_add_lru.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
      0.83 ±  2%      -0.0        0.80        perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
      0.65 ±  2%      -0.0        0.62        perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.do_user_addr_fault.exc_page_fault
      0.73 ±  2%      -0.0        0.70        perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
      0.64 ±  2%      -0.0        0.61 ±  2%  perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.do_user_addr_fault
      0.73            -0.0        0.71        perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.testcase
      0.84            -0.0        0.82        perf-profile.calltrace.cycles-pp.clear_page_erms.prep_new_page.get_page_from_freelist.__alloc_frozen_pages_noprof.alloc_pages_mpol
      1.65            +0.0        1.68        perf-profile.calltrace.cycles-pp.__alloc_frozen_pages_noprof.alloc_pages_mpol.vma_alloc_folio_noprof.alloc_anon_folio.do_anonymous_page
      1.92            +0.0        1.96        perf-profile.calltrace.cycles-pp.vma_alloc_folio_noprof.alloc_anon_folio.do_anonymous_page.__handle_mm_fault.handle_mm_fault
      1.79            +0.0        1.83        perf-profile.calltrace.cycles-pp.alloc_pages_mpol.vma_alloc_folio_noprof.alloc_anon_folio.do_anonymous_page.__handle_mm_fault
      0.92 ±  3%      +0.1        1.04        perf-profile.calltrace.cycles-pp.tlb_gather_mmu.do_madvise.__x64_sys_madvise.do_syscall_64.entry_SYSCALL_64_after_hwframe
      2.83 ±  6%      +0.2        3.04 ±  2%  perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle
      7.06            +0.4        7.48        perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault
      6.25            +0.4        6.70        perf-profile.calltrace.cycles-pp.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault
      2.48            +0.5        3.02        perf-profile.calltrace.cycles-pp.alloc_anon_folio.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
      0.00            +0.7        0.74 ±  5%  perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.alloc_anon_folio.do_anonymous_page.__handle_mm_fault
      0.00            +0.9        0.94 ±  4%  perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.alloc_anon_folio.do_anonymous_page.__handle_mm_fault.handle_mm_fault
     19.08            +1.3       20.36        perf-profile.calltrace.cycles-pp.testcase
     14.17            +1.3       15.46        perf-profile.calltrace.cycles-pp.asm_exc_page_fault.testcase
     12.74            +1.4       14.10        perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault.testcase
     12.49            +1.4       13.85        perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
      7.97            +1.4        9.38        perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.testcase
     62.25            -1.9       60.39        perf-profile.children.cycles-pp.smp_call_function_many_cond
     62.26            -1.9       60.40        perf-profile.children.cycles-pp.on_each_cpu_cond_mask
     63.94            -1.8       62.08        perf-profile.children.cycles-pp.flush_tlb_mm_range
     65.76            -1.7       64.06        perf-profile.children.cycles-pp.tlb_finish_mmu
     75.72            -1.5       74.19        perf-profile.children.cycles-pp.__madvise
     73.51            -1.5       72.02        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     73.48            -1.5       71.99        perf-profile.children.cycles-pp.do_syscall_64
     71.90            -1.5       70.41        perf-profile.children.cycles-pp.__x64_sys_madvise
     71.85            -1.5       70.36        perf-profile.children.cycles-pp.do_madvise
     18.46            -0.3       18.12        perf-profile.children.cycles-pp.__flush_smp_call_function_queue
     19.79            -0.3       19.47        perf-profile.children.cycles-pp.sysvec_call_function
     22.68            -0.3       22.36        perf-profile.children.cycles-pp.asm_sysvec_call_function
     17.97            -0.3       17.65        perf-profile.children.cycles-pp.__sysvec_call_function
      7.64            -0.2        7.46        perf-profile.children.cycles-pp.flush_tlb_func
      7.49            -0.1        7.39        perf-profile.children.cycles-pp.llist_reverse_order
      1.47            -0.1        1.42        perf-profile.children.cycles-pp.folio_add_lru
      1.99            -0.0        1.94        perf-profile.children.cycles-pp.__pte_offset_map_lock
      1.84            -0.0        1.80        perf-profile.children.cycles-pp._raw_spin_lock
      1.31            -0.0        1.27        perf-profile.children.cycles-pp.folio_batch_move_lru
      0.93            -0.0        0.90        perf-profile.children.cycles-pp.error_entry
      0.42            -0.0        0.40        perf-profile.children.cycles-pp.vms_clear_ptes
      0.89            -0.0        0.87        perf-profile.children.cycles-pp.clear_page_erms
      0.94            -0.0        0.92        perf-profile.children.cycles-pp.prep_new_page
      1.66            +0.0        1.69        perf-profile.children.cycles-pp.__alloc_frozen_pages_noprof
      1.80            +0.0        1.84        perf-profile.children.cycles-pp.alloc_pages_mpol
      0.00            +0.1        0.05        perf-profile.children.cycles-pp.__pi_memset
      0.96 ±  3%      +0.1        1.08        perf-profile.children.cycles-pp.tlb_gather_mmu
      2.90 ±  6%      +0.2        3.10        perf-profile.children.cycles-pp.intel_idle
      3.23 ±  5%      +0.2        3.45        perf-profile.children.cycles-pp.cpuidle_enter
      3.32 ±  5%      +0.2        3.54        perf-profile.children.cycles-pp.cpuidle_idle_call
      7.09            +0.4        7.51        perf-profile.children.cycles-pp.__handle_mm_fault
      6.27            +0.4        6.72        perf-profile.children.cycles-pp.do_anonymous_page
      0.43 ±  6%      +0.5        0.94 ±  4%  perf-profile.children.cycles-pp.__mem_cgroup_charge
      0.25 ± 11%      +0.5        0.76 ±  5%  perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
      2.49            +0.5        3.03        perf-profile.children.cycles-pp.alloc_anon_folio
     19.67            +1.3       20.94        perf-profile.children.cycles-pp.testcase
     14.47            +1.3       15.77        perf-profile.children.cycles-pp.asm_exc_page_fault
     12.76            +1.4       14.12        perf-profile.children.cycles-pp.exc_page_fault
     12.60            +1.4       13.96        perf-profile.children.cycles-pp.do_user_addr_fault
      7.99            +1.5        9.44        perf-profile.children.cycles-pp.handle_mm_fault
     42.94            -1.2       41.71        perf-profile.self.cycles-pp.smp_call_function_many_cond
      6.02            -0.2        5.87        perf-profile.self.cycles-pp.flush_tlb_func
      7.46            -0.1        7.36        perf-profile.self.cycles-pp.llist_reverse_order
      1.44            -0.0        1.40        perf-profile.self.cycles-pp.lock_vma_under_rcu
      0.88            -0.0        0.85        perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      0.91            -0.0        0.88        perf-profile.self.cycles-pp.error_entry
      0.07            +0.0        0.08 ±  4%  perf-profile.self.cycles-pp.get_page_from_freelist
      0.76 ±  2%      +0.1        0.86        perf-profile.self.cycles-pp.tlb_gather_mmu
      1.10 ±  5%      +0.1        1.24 ±  2%  perf-profile.self.cycles-pp.tlb_finish_mmu
      2.90 ±  6%      +0.2        3.10        perf-profile.self.cycles-pp.intel_idle
      0.20 ± 10%      +0.4        0.62 ±  6%  perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
      0.15 ±  4%      +0.8        1.00 ±  3%  perf-profile.self.cycles-pp.handle_mm_fault




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing
  2025-10-11 18:24 ` [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing Tim Chen
  2025-10-14 19:12   ` Madadi Vineeth Reddy
@ 2025-10-23  7:26   ` kernel test robot
  2025-10-27  4:47   ` K Prateek Nayak
  2 siblings, 0 replies; 116+ messages in thread
From: kernel test robot @ 2025-10-23  7:26 UTC (permalink / raw)
  To: Tim Chen
  Cc: oe-lkp, lkp, Chen Yu, Tim Chen, linux-mm, linux-kernel, aubrey.li,
	Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Libo Chen, Adam Li, Tim Chen, oliver.sang



Hello,

kernel test robot noticed a 5.1% improvement of will-it-scale.per_thread_ops on:


commit: ddf7df94672b42db9a86b3225cf9ebcfdfefc506 ("[PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing")
url: https://github.com/intel-lab-lkp/linux/commits/Tim-Chen/sched-fair-Add-infrastructure-for-cache-aware-load-balancing/20251012-022248
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 45b7f780739a3145aeef24d2dfa02517a6c82ed6
patch link: https://lore.kernel.org/all/865b852e3fdef6561c9e0a5be9a94aec8a68cdea.1760206683.git.tim.c.chen@linux.intel.com/
patch subject: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing

testcase: will-it-scale
config: x86_64-rhel-9.4
compiler: gcc-14
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

	nr_task: 100%
	mode: thread
	test: mmap1
	cpufreq_governor: performance



Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251023/202510231459.ad690ecd-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  gcc-14/performance/x86_64-rhel-9.4/thread/100%/debian-13-x86_64-20250902.cgz/lkp-icl-2sp7/mmap1/will-it-scale

commit: 
  45b7f78073 ("sched: Fix some typos in include/linux/preempt.h")
  ddf7df9467 ("sched/fair: Add infrastructure for cache-aware load balancing")

45b7f780739a3145 ddf7df94672b42db9a86b3225cf 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
     12.30 ±  4%      +1.2       13.54 ±  2%  turbostat.C6%
     16510 ±  2%     +11.0%      18333 ±  5%  perf-c2c.HITM.local
     14007 ±  2%     +11.9%      15670 ±  5%  perf-c2c.HITM.remote
     30518 ±  2%     +11.4%      34003 ±  5%  perf-c2c.HITM.total
      4216 ±117%     -82.9%     720.83 ± 87%  sched_debug.cfs_rq:/.load_avg.max
    230030 ±  6%     +21.9%     280349 ±  3%  sched_debug.cpu.nr_switches.min
  32696287          +100.0%   65398751        sched_debug.sysctl_sched.sysctl_sched_features
     71075            +5.1%      74710        will-it-scale.64.threads
      1109            +5.1%       1166        will-it-scale.per_thread_ops
     71075            +5.1%      74710        will-it-scale.workload
  20012272            +1.8%   20374679        perf-stat.i.branch-misses
  20662368            +4.4%   21568525        perf-stat.i.cache-references
    181255 ±  6%     +10.5%     200376 ±  2%  perf-stat.i.context-switches
    177.39            +5.8%     187.65        perf-stat.i.cpu-migrations
     22226 ±  3%      -5.9%      20924 ±  2%  perf-stat.i.cycles-between-cache-misses
      2.83 ±  6%     +10.5%       3.13 ±  2%  perf-stat.i.metric.K/sec
      0.26            +0.0        0.27        perf-stat.overall.branch-miss-rate%
 1.609e+08            -5.6%  1.518e+08        perf-stat.overall.path-length
  19825311            +1.9%   20209803        perf-stat.ps.branch-misses
  20712472            +4.4%   21614327        perf-stat.ps.cache-references
    180352 ±  6%     +10.5%     199276 ±  2%  perf-stat.ps.context-switches
    177.05            +5.8%     187.32        perf-stat.ps.cpu-migrations
     47.47            -0.1       47.36        perf-profile.calltrace.cycles-pp.osq_lock.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
     48.49            -0.1       48.40        perf-profile.calltrace.cycles-pp.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
     48.39            -0.1       48.30        perf-profile.calltrace.cycles-pp.down_write_killable.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
     47.36            -0.1       47.27        perf-profile.calltrace.cycles-pp.osq_lock.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.do_syscall_64
     48.44            -0.1       48.35        perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64
     48.34            -0.1       48.25        perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
     49.70            -0.1       49.62        perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
     49.70            -0.1       49.62        perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
     49.71            -0.1       49.64        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
     49.73            -0.1       49.66        perf-profile.calltrace.cycles-pp.__munmap
     49.71            -0.1       49.64        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap
     49.11            -0.1       49.06        perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
      0.52            +0.0        0.54        perf-profile.calltrace.cycles-pp.vms_gather_munmap_vmas.do_vmi_align_munmap.do_vmi_munmap.__vm_munmap.__x64_sys_munmap
      0.54 ±  2%      +0.0        0.56        perf-profile.calltrace.cycles-pp.do_mmap.vm_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
      0.67 ±  3%      +0.1        0.74 ±  2%  perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry
      0.68 ±  3%      +0.1        0.74 ±  2%  perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
      0.71 ±  3%      +0.1        0.78 ±  2%  perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.common_startup_64
      0.99 ±  3%      +0.1        1.10 ±  2%  perf-profile.calltrace.cycles-pp.common_startup_64
      0.97 ±  3%      +0.1        1.08 ±  2%  perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.common_startup_64
      0.97 ±  3%      +0.1        1.08 ±  2%  perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.common_startup_64
      0.97 ±  3%      +0.1        1.08 ±  2%  perf-profile.calltrace.cycles-pp.start_secondary.common_startup_64
     94.85            -0.2       94.65        perf-profile.children.cycles-pp.osq_lock
     96.81            -0.2       96.64        perf-profile.children.cycles-pp.rwsem_down_write_slowpath
     96.87            -0.2       96.70        perf-profile.children.cycles-pp.down_write_killable
     98.91            -0.1       98.80        perf-profile.children.cycles-pp.do_syscall_64
     98.91            -0.1       98.80        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     49.70            -0.1       49.62        perf-profile.children.cycles-pp.__x64_sys_munmap
     49.70            -0.1       49.62        perf-profile.children.cycles-pp.__vm_munmap
     49.73            -0.1       49.66        perf-profile.children.cycles-pp.__munmap
      0.05            +0.0        0.06        perf-profile.children.cycles-pp.wake_q_add
      0.23            +0.0        0.24        perf-profile.children.cycles-pp.kmem_cache_free
      0.15 ±  2%      +0.0        0.16        perf-profile.children.cycles-pp.anon_vma_clone
      0.07            +0.0        0.08 ±  4%  perf-profile.children.cycles-pp.ttwu_queue_wakelist
      0.07 ±  5%      +0.0        0.08 ±  5%  perf-profile.children.cycles-pp.pick_next_task_fair
      0.08 ±  4%      +0.0        0.09 ±  5%  perf-profile.children.cycles-pp.__pick_next_task
      0.07            +0.0        0.08 ±  5%  perf-profile.children.cycles-pp.raw_spin_rq_lock_nested
      0.22            +0.0        0.24 ±  3%  perf-profile.children.cycles-pp.vma_expand
      0.09 ±  4%      +0.0        0.10 ±  4%  perf-profile.children.cycles-pp.dequeue_task_fair
      0.10 ±  3%      +0.0        0.12 ±  3%  perf-profile.children.cycles-pp.schedule_idle
      0.08 ±  4%      +0.0        0.10 ±  3%  perf-profile.children.cycles-pp.dequeue_entity
      0.09 ±  4%      +0.0        0.11 ±  4%  perf-profile.children.cycles-pp.try_to_block_task
      0.45            +0.0        0.47        perf-profile.children.cycles-pp.__split_vma
      0.08 ±  5%      +0.0        0.10 ±  4%  perf-profile.children.cycles-pp.dequeue_entities
      0.16 ±  2%      +0.0        0.18 ±  4%  perf-profile.children.cycles-pp.commit_merge
      0.16            +0.0        0.18 ±  5%  perf-profile.children.cycles-pp.vma_complete
      0.12 ±  7%      +0.0        0.14 ±  3%  perf-profile.children.cycles-pp.__flush_smp_call_function_queue
      0.25            +0.0        0.27 ±  3%  perf-profile.children.cycles-pp.vma_merge_new_range
      0.54            +0.0        0.56        perf-profile.children.cycles-pp.do_mmap
      0.08 ±  5%      +0.0        0.11 ±  4%  perf-profile.children.cycles-pp._raw_spin_lock_irqsave
      0.05 ±  7%      +0.0        0.08 ±  4%  perf-profile.children.cycles-pp.update_curr
      0.04 ± 44%      +0.0        0.08 ±  6%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
      0.41            +0.0        0.44 ±  2%  perf-profile.children.cycles-pp.__hrtimer_run_queues
      0.39            +0.0        0.43 ±  2%  perf-profile.children.cycles-pp.tick_nohz_handler
      0.18 ±  4%      +0.0        0.21 ±  3%  perf-profile.children.cycles-pp.schedule
      0.17 ±  3%      +0.0        0.21 ±  2%  perf-profile.children.cycles-pp.schedule_preempt_disabled
      0.18 ±  2%      +0.0        0.22 ±  2%  perf-profile.children.cycles-pp.try_to_wake_up
      0.18 ±  4%      +0.0        0.22 ±  3%  perf-profile.children.cycles-pp.wake_up_q
      0.35            +0.0        0.39 ±  2%  perf-profile.children.cycles-pp.update_process_times
      0.24 ±  3%      +0.0        0.28 ±  3%  perf-profile.children.cycles-pp.sched_tick
      0.25 ±  3%      +0.0        0.29 ±  3%  perf-profile.children.cycles-pp.rwsem_wake
      0.07            +0.0        0.12 ±  3%  perf-profile.children.cycles-pp._raw_spin_lock
      0.28 ±  2%      +0.0        0.33 ±  2%  perf-profile.children.cycles-pp.__schedule
      0.18 ±  3%      +0.1        0.23 ±  5%  perf-profile.children.cycles-pp.task_tick_fair
      0.00            +0.1        0.06        perf-profile.children.cycles-pp.update_se
      0.69 ±  3%      +0.1        0.76 ±  2%  perf-profile.children.cycles-pp.cpuidle_enter
      0.69 ±  3%      +0.1        0.76 ±  2%  perf-profile.children.cycles-pp.cpuidle_enter_state
      0.72 ±  3%      +0.1        0.79 ±  2%  perf-profile.children.cycles-pp.cpuidle_idle_call
      0.36 ±  3%      +0.1        0.44 ±  2%  perf-profile.children.cycles-pp.intel_idle_irq
      0.99 ±  3%      +0.1        1.10 ±  2%  perf-profile.children.cycles-pp.common_startup_64
      0.99 ±  3%      +0.1        1.10 ±  2%  perf-profile.children.cycles-pp.cpu_startup_entry
      0.99 ±  3%      +0.1        1.10 ±  2%  perf-profile.children.cycles-pp.do_idle
      0.97 ±  3%      +0.1        1.08 ±  2%  perf-profile.children.cycles-pp.start_secondary
     94.33            -0.2       94.10        perf-profile.self.cycles-pp.osq_lock
      0.32            -0.0        0.29 ±  2%  perf-profile.self.cycles-pp.rwsem_down_write_slowpath
      0.07 ±  5%      -0.0        0.05        perf-profile.self.cycles-pp.vms_complete_munmap_vmas
      0.05            +0.0        0.06        perf-profile.self.cycles-pp.wake_q_add
      0.06            +0.0        0.07 ±  5%  perf-profile.self.cycles-pp._raw_spin_lock
      0.06 ±  6%      +0.0        0.08 ±  4%  perf-profile.self.cycles-pp._raw_spin_lock_irqsave
      0.04 ± 44%      +0.0        0.08 ±  6%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
      0.34 ±  4%      +0.1        0.42        perf-profile.self.cycles-pp.intel_idle_irq




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing
  2025-10-11 18:24 ` [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing Tim Chen
                     ` (2 preceding siblings ...)
  2025-10-15 15:10   ` Peter Zijlstra
@ 2025-10-24  9:32   ` Aaron Lu
  2025-10-27  2:00     ` Chen, Yu C
  2025-10-27  6:29   ` K Prateek Nayak
  4 siblings, 1 reply; 116+ messages in thread
From: Aaron Lu @ 2025-10-24  9:32 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

Hi Tim,

On Sat, Oct 11, 2025 at 11:24:47AM -0700, Tim Chen wrote:
> @@ -10849,11 +10849,45 @@ static void record_sg_llc_stats(struct lb_env *env,
>  	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
>  		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
>  }
> +
> +/*
> + * Do LLC balance on sched group that contains LLC, and have tasks preferring
> + * to run on LLC in idle dst_cpu.
> + */
> +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> +			       struct sched_group *group)
> +{
> +	struct sched_domain *child = env->sd->child;
> +	int llc;
> +
> +	if (!sched_cache_enabled())
> +		return false;
> +
> +	if (env->sd->flags & SD_SHARE_LLC)
> +		return false;
> +
> +	/* only care about task migration among LLCs */
> +	if (child && !(child->flags & SD_SHARE_LLC))
> +		return false;
> +
> +	llc = llc_idx(env->dst_cpu);
> +	if (sgs->nr_pref_llc[llc] > 0 &&
> +	    can_migrate_llc(env->src_cpu, env->dst_cpu, 0, true) == mig_llc)

llc_balance() is called from update_sg_lb_stats() and at that time,
env->src_cpu is not determined yet so should not be used here?

> +		return true;
> +
> +	return false;
> +}

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing
  2025-10-24  9:32   ` Aaron Lu
@ 2025-10-27  2:00     ` Chen, Yu C
  2025-10-29  9:51       ` Aaron Lu
  0 siblings, 1 reply; 116+ messages in thread
From: Chen, Yu C @ 2025-10-27  2:00 UTC (permalink / raw)
  To: Aaron Lu, Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Libo Chen, Adam Li, Tim Chen, linux-kernel

Hi Aaron,

On 10/24/2025 5:32 PM, Aaron Lu wrote:
> Hi Tim,
> 
> On Sat, Oct 11, 2025 at 11:24:47AM -0700, Tim Chen wrote:
>> @@ -10849,11 +10849,45 @@ static void record_sg_llc_stats(struct lb_env *env,
>>   	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
>>   		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
>>   }
>> +
>> +/*
>> + * Do LLC balance on sched group that contains LLC, and have tasks preferring
>> + * to run on LLC in idle dst_cpu.
>> + */
>> +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
>> +			       struct sched_group *group)
>> +{
>> +	struct sched_domain *child = env->sd->child;
>> +	int llc;
>> +
>> +	if (!sched_cache_enabled())
>> +		return false;
>> +
>> +	if (env->sd->flags & SD_SHARE_LLC)
>> +		return false;
>> +
>> +	/* only care about task migration among LLCs */
>> +	if (child && !(child->flags & SD_SHARE_LLC))
>> +		return false;
>> +
>> +	llc = llc_idx(env->dst_cpu);
>> +	if (sgs->nr_pref_llc[llc] > 0 &&
>> +	    can_migrate_llc(env->src_cpu, env->dst_cpu, 0, true) == mig_llc)
> 
> llc_balance() is called from update_sg_lb_stats() and at that time,
> env->src_cpu is not determined yet so should not be used here?
> 

You are right, I think we should check the candidate group's first
CPU rather than the env->src_cpu. Will fix it in the next version.
Thanks a lot!

chenyu
>> +		return true;
>> +
>> +	return false;
>> +}

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing
  2025-10-11 18:24 ` [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing Tim Chen
  2025-10-14 19:12   ` Madadi Vineeth Reddy
  2025-10-23  7:26   ` kernel test robot
@ 2025-10-27  4:47   ` K Prateek Nayak
  2025-10-27 13:35     ` Chen, Yu C
  2 siblings, 1 reply; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-27  4:47 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

Hello Tim,

On 10/11/2025 11:54 PM, Tim Chen wrote:

[..snip..]

>  static s64 update_se(struct rq *rq, struct sched_entity *se)
>  {
>  	u64 now = rq_clock_task(rq);
> @@ -1174,6 +1176,7 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>  
>  		trace_sched_stat_runtime(running, delta_exec);
>  		account_group_exec_runtime(running, delta_exec);
> +		account_mm_sched(rq, donor, delta_exec);

Shouldn't we attribute this to "rq->curr"/"running" since that is the
task which is actually running on the CPU (with "rq->curr->mm" being the
one that is being used on CPU) as opposed to the "donor" which is just
providing the vruntime context?

>  
>  		/* cgroup time is always accounted against the donor */
>  		cgroup_account_cputime(donor, delta_exec);

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 02/19] sched/fair: Record per-LLC utilization to guide cache-aware scheduling decisions
  2025-10-11 18:24 ` [PATCH 02/19] sched/fair: Record per-LLC utilization to guide cache-aware scheduling decisions Tim Chen
  2025-10-15 10:15   ` Peter Zijlstra
@ 2025-10-27  5:01   ` K Prateek Nayak
  2025-10-27 14:07     ` Chen, Yu C
  1 sibling, 1 reply; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-27  5:01 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Chen Yu, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

Hello Tim,

On 10/11/2025 11:54 PM, Tim Chen wrote:
> +#ifdef CONFIG_SCHED_CACHE
> +/*
> + * Record the statistics for this scheduler group for later
> + * use. These values guide load balancing on aggregating tasks
> + * to a LLC.
> + */
> +static void record_sg_llc_stats(struct lb_env *env,
> +				struct sg_lb_stats *sgs,
> +				struct sched_group *group)
> +{
> +	/*
> +	 * Find the child domain on env->dst_cpu. This domain
> +	 * is either the domain that spans this group(if the
> +	 * group is a local group), or the sibling domain of
> +	 * this group.
> +	 */
> +	struct sched_domain *sd = env->sd->child;

Was this intentionally done to limit the update to sg_llc_stats to the
load balancing period of "sd_llc->parent"?

Can't this be done with update_idle_cpu_scan()? I believe it is more
frequent, "sds->total_capacity" from caller gives you the equivalent of
"group_capacity", and "group_util" is already calculated as "sum_util".

Checking "sd_llc->parent" there should be sufficient to check if there
are multiple LLC domains or not. Thoughts?

> +	struct sched_domain_shared *sd_share;
> +
> +	if (!sched_feat(SCHED_CACHE) || env->idle == CPU_NEWLY_IDLE)
> +		return;
> +
> +	/* only care about sched domains spanning a LLC */
> +	if (sd != rcu_dereference(per_cpu(sd_llc, env->dst_cpu)))
> +		return;
> +
> +	/*
> +	 * At this point we know this group spans a LLC domain.
> +	 * Record the statistic of this group in its corresponding
> +	 * shared LLC domain.
> +	 */
> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared,
> +					   cpumask_first(sched_group_span(group))));
> +	if (!sd_share)
> +		return;
> +
> +	if (READ_ONCE(sd_share->util_avg) != sgs->group_util)
> +		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
> +
> +	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
> +		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
> +}

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs
  2025-10-11 18:24 ` [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs Tim Chen
  2025-10-15 11:04   ` Peter Zijlstra
@ 2025-10-27  5:42   ` K Prateek Nayak
  2025-10-27 12:56     ` Chen, Yu C
  1 sibling, 1 reply; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-27  5:42 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Chen Yu, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

Hello Tim,

On 10/11/2025 11:54 PM, Tim Chen wrote:
> @@ -2530,10 +2531,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  				 * between LLCs and memory channels.
>  				 */
>  				nr_llcs = sd->span_weight / child->span_weight;
> -				if (nr_llcs == 1)
> +				if (nr_llcs == 1) {
>  					imb = sd->span_weight >> 3;
> -				else
> +				} else {
>  					imb = nr_llcs;
> +					has_multi_llcs = true;

One caution: this will not hold if all the CPUs aren't online during boot.
One case I can think of is when the kernel is booted with "maxcpus" cmdline
and CPUs are hotplugged later.

Unfortunately, I don't think we even have the raw topology data from the
arch/ side under such scenario to accurately make a call if the system
contains single or multiple LLC :(

I'm not sure if it is feasible but assuming the task_work() cannot run if
&sched_cache_allowed is false, can the fist instance of the task work for
sched_cache do the necessary setup?

> +				}
>  				imb = max(1U, imb);
>  				sd->imb_numa_nr = imb;
>  
> @@ -2581,6 +2584,13 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  	if (has_cluster)
>  		static_branch_inc_cpuslocked(&sched_cluster_active);
>  
> +#ifdef CONFIG_SCHED_CACHE
> +	if (has_multi_llcs) {
> +		static_branch_enable_cpuslocked(&sched_cache_allowed);
> +		pr_info("Cache aware load balance enabled.\n");
> +	}
> +#endif
> +
>  	if (rq && sched_debug_verbose)
>  		pr_info("root domain span: %*pbl\n", cpumask_pr_args(cpu_map));
>  

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue
  2025-10-11 18:24 ` [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue Tim Chen
  2025-10-15 12:05   ` Peter Zijlstra
@ 2025-10-27  6:04   ` K Prateek Nayak
  2025-10-28 15:15     ` Chen, Yu C
  2025-10-28 17:06     ` Tim Chen
  1 sibling, 2 replies; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-27  6:04 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

Hello Tim,

On 10/11/2025 11:54 PM, Tim Chen wrote:
> @@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  		struct rq *rq = rq_of(cfs_rq);
>  
>  		account_numa_enqueue(rq, task_of(se));
> +		account_llc_enqueue(rq, task_of(se));
>  		list_add(&se->group_node, &rq->cfs_tasks);
>  	}
>  	cfs_rq->nr_queued++;
> @@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  	update_load_sub(&cfs_rq->load, se->load.weight);
>  	if (entity_is_task(se)) {
>  		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
> +		account_llc_dequeue(rq_of(cfs_rq), task_of(se));
>  		list_del_init(&se->group_node);
>  	}
>  	cfs_rq->nr_queued--;
> +
> +	/* safeguard to clear the cache aware data */
> +	if (!parent_entity(se) && !cfs_rq->nr_queued)
> +		reset_llc_stats(rq_of(cfs_rq));

Instead of relying on reset_llc_stats() hack, I think a better approach
would be to have a "p->se.llc_sched_active" flag similar to how uclamp
has "uc_se->active" and we set this in account_llc_enqueue() which will
still check for sched_cache_enabled() but account_llc_dequeue() would
only check for "p->se.llc_sched_active" to decrement the stats and then
unset the flag.

That way, we cannot have an imbalanced accounting. Thoughts?

>  }
>  

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing
  2025-10-11 18:24 ` [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing Tim Chen
                     ` (3 preceding siblings ...)
  2025-10-24  9:32   ` Aaron Lu
@ 2025-10-27  6:29   ` K Prateek Nayak
  2025-10-28 12:11     ` Chen, Yu C
  4 siblings, 1 reply; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-27  6:29 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

Hello Tim,

On 10/11/2025 11:54 PM, Tim Chen wrote:
> +/*
> + * Do LLC balance on sched group that contains LLC, and have tasks preferring
> + * to run on LLC in idle dst_cpu.
> + */
> +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> +			       struct sched_group *group)
> +{
> +	struct sched_domain *child = env->sd->child;
> +	int llc;
> +
> +	if (!sched_cache_enabled())
> +		return false;
> +
> +	if (env->sd->flags & SD_SHARE_LLC)
> +		return false;
> +
> +	/* only care about task migration among LLCs */
> +	if (child && !(child->flags & SD_SHARE_LLC))

nit. You can just check group->flags here.

> +		return false;
> +
> +	llc = llc_idx(env->dst_cpu);
> +	if (sgs->nr_pref_llc[llc] > 0 &&
> +	    can_migrate_llc(env->src_cpu, env->dst_cpu, 0, true) == mig_llc)
> +		return true;
> +
> +	return false;
> +}
>  #else
>  static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
>  				       struct sched_group *group)
>  {
>  }
> +
> +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> +			       struct sched_group *group)
> +{
> +	return false;
> +}
>  #endif
>  
>  /**
> @@ -10954,6 +10988,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>  
>  	record_sg_llc_stats(env, sgs, group);

Okay, I see the intention of recording this based on group stats.
Sorry for the noise on Patch 2.

> +
> +	/* Check for tasks in this group can be moved to their preferred LLC */
> +	if (!local_group && llc_balance(env, sgs, group))
> +		sgs->group_llc_balance = 1;
> +
>  	/* Computing avg_load makes sense only when group is overloaded */
>  	if (sgs->group_type == group_overloaded)
>  		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 09/19] sched/fair: Count tasks prefering each LLC in a sched group
  2025-10-11 18:24 ` [PATCH 09/19] sched/fair: Count tasks prefering each LLC in a sched group Tim Chen
  2025-10-15 12:22   ` Peter Zijlstra
  2025-10-15 12:25   ` Peter Zijlstra
@ 2025-10-27  8:33   ` K Prateek Nayak
  2025-10-27 23:19     ` Tim Chen
  2 siblings, 1 reply; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-27  8:33 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

Hello Tim,

On 10/11/2025 11:54 PM, Tim Chen wrote:
> +#ifdef CONFIG_SCHED_CACHE
> +		if (sched_cache_enabled()) {
> +			int j;
> +
> +			for (j = 0; j < max_llcs; ++j)
> +				sgs->nr_pref_llc[j] += rq->nr_pref_llc[j];
> +		}
> +#endif

If I'm not mistaken, we only compare
"sds->nr_pref_llc[llc_idx(env->dst_cpu)]"
and the destination LLC is always fixes. Do we need to aggregate the
data for all the LLCs? Is a single "nr_pref_llc_dest" enough?

>  		/*
>  		 * No need to call idle_cpu() if nr_running is not 0
>  		 */

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 12/19] sched/fair: Add migrate_llc_task migration type for cache-aware balancing
  2025-10-11 18:24 ` [PATCH 12/19] sched/fair: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
@ 2025-10-27  9:04   ` K Prateek Nayak
  2025-10-27 22:59     ` Tim Chen
  0 siblings, 1 reply; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-27  9:04 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

Hello Tim,

On 10/11/2025 11:54 PM, Tim Chen wrote:
> @@ -12149,6 +12167,16 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
>  			}
>  			break;
>  
> +		case migrate_llc_task:
> +#ifdef CONFIG_SCHED_CACHE
> +			dst_llc = llc_idx(env->dst_cpu);
> +			if (!cpus_share_cache(env->dst_cpu, rq->cpu) &&

Busiest group is always a non-local group right? Can cpus_share_cache()
ever happen given we are looking at groups of first !SD_LLC_SHARE
domain for "migrate_llc_task"?

> +			    busiest_pref_llc < rq->nr_pref_llc[dst_llc]) {
> +				busiest_pref_llc = rq->nr_pref_llc[dst_llc];
> +				busiest = rq;
> +			}
> +#endif
> +			break;
>  		case migrate_task:
>  			if (busiest_nr < nr_running) {
>  				busiest_nr = nr_running;

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs
  2025-10-27  5:42   ` K Prateek Nayak
@ 2025-10-27 12:56     ` Chen, Yu C
  2025-10-27 23:36       ` Tim Chen
  2025-10-28  2:46       ` K Prateek Nayak
  0 siblings, 2 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-27 12:56 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Tim Chen, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

Hi Prateek,

On 10/27/2025 1:42 PM, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 10/11/2025 11:54 PM, Tim Chen wrote:
>> @@ -2530,10 +2531,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>   				 * between LLCs and memory channels.
>>   				 */
>>   				nr_llcs = sd->span_weight / child->span_weight;
>> -				if (nr_llcs == 1)
>> +				if (nr_llcs == 1) {
>>   					imb = sd->span_weight >> 3;
>> -				else
>> +				} else {
>>   					imb = nr_llcs;
>> +					has_multi_llcs = true;
> 
> One caution: this will not hold if all the CPUs aren't online during boot.
> One case I can think of is when the kernel is booted with "maxcpus" cmdline
> and CPUs are hotplugged later.
> 
> Unfortunately, I don't think we even have the raw topology data from the
> arch/ side under such scenario to accurately make a call if the system
> contains single or multiple LLC :(
> 
> I'm not sure if it is feasible but assuming the task_work() cannot run if
> &sched_cache_allowed is false, can the fist instance of the task work for
> sched_cache do the necessary setup?
> 

build_sched_domains() might get invoked to rebuild the corresponding sched
domains during CPU hotplug via cpuset subsystem. So if the CPU gets online
after bootup, we still have the chance to detect multiple LLCs I suppose?

I did a check on my VM:
root@ubuntu:/sys/devices/system/cpu# lscpu
CPU(s):                      32
   On-line CPU(s) list:       0-7
root@ubuntu:/sys/devices/system/cpu# echo 1 > cpu31/online
Tracing ... Hit Ctrl-C to end.
^C

@build_sched_domains[
     build_sched_domains+5
     partition_sched_domains+613
     cpuset_update_active_cpus+838
     sched_cpu_activate+272
     cpuhp_invoke_callback+340
     cpuhp_thread_fun+139
     smpboot_thread_fn+238
     kthread+249
     ret_from_fork+193
     ret_from_fork_asm+26
]: 1

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing
  2025-10-27  4:47   ` K Prateek Nayak
@ 2025-10-27 13:35     ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-27 13:35 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar, Tim Chen

Hi Prateek,

On 10/27/2025 12:47 PM, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 10/11/2025 11:54 PM, Tim Chen wrote:
> 
> [..snip..]
> 
>>   static s64 update_se(struct rq *rq, struct sched_entity *se)
>>   {
>>   	u64 now = rq_clock_task(rq);
>> @@ -1174,6 +1176,7 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>>   
>>   		trace_sched_stat_runtime(running, delta_exec);
>>   		account_group_exec_runtime(running, delta_exec);
>> +		account_mm_sched(rq, donor, delta_exec);
> 
> Shouldn't we attribute this to "rq->curr"/"running" since that is the
> task which is actually running on the CPU (with "rq->curr->mm" being the
> one that is being used on CPU) as opposed to the "donor" which is just
> providing the vruntime context?
> 

This is a good point. I'm not quite familiar with proxy execution,
but after studying commit aa4f74dfd42b ("sched: Fix runtime accounting
w/ split exec & sched contexts"), it seems that if it is related to the
raw running time, the duration should be accumulated to rq->curr, the
  actual proxy task. If it is related to vruntime - which might be
directly tied to the task selection/duration strategy-we should
accumulate the delta to rq->donor, whose context is being borrowed. Is
this a convention, or did we encounter any issues before aa4f74dfd42b?

I think it makes sense to change to rq->curr. As in "[PATCH
16/19] sched/fair: Exclude processes with many threads from cache-aware
scheduling," we use rq->curr to determine how many active threads the
process has. We should also change rq->donor to rq->curr in this
[PATCH 1/19] to keep them consistent.

thanks,
Chenyu
>>   
>>   		/* cgroup time is always accounted against the donor */
>>   		cgroup_account_cputime(donor, delta_exec);
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 02/19] sched/fair: Record per-LLC utilization to guide cache-aware scheduling decisions
  2025-10-27  5:01   ` K Prateek Nayak
@ 2025-10-27 14:07     ` Chen, Yu C
  2025-10-28  2:50       ` K Prateek Nayak
  0 siblings, 1 reply; 116+ messages in thread
From: Chen, Yu C @ 2025-10-27 14:07 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Ingo Molnar, Adam Li, Tim Chen, linux-kernel, Tim Chen,
	Peter Zijlstra, Gautham R . Shenoy

Hi Prateek,

On 10/27/2025 1:01 PM, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 10/11/2025 11:54 PM, Tim Chen wrote:
>> +#ifdef CONFIG_SCHED_CACHE
>> +/*
>> + * Record the statistics for this scheduler group for later
>> + * use. These values guide load balancing on aggregating tasks
>> + * to a LLC.
>> + */
>> +static void record_sg_llc_stats(struct lb_env *env,
>> +				struct sg_lb_stats *sgs,
>> +				struct sched_group *group)
>> +{
>> +	/*
>> +	 * Find the child domain on env->dst_cpu. This domain
>> +	 * is either the domain that spans this group(if the
>> +	 * group is a local group), or the sibling domain of
>> +	 * this group.
>> +	 */
>> +	struct sched_domain *sd = env->sd->child;
> 
> Was this intentionally done to limit the update to sg_llc_stats to the
> load balancing period of "sd_llc->parent"?
> 
> Can't this be done with update_idle_cpu_scan()? I believe it is more
> frequent, "sds->total_capacity" from caller gives you the equivalent of
> "group_capacity", and "group_util" is already calculated as "sum_util".
> 
> Checking "sd_llc->parent" there should be sufficient to check if there
> are multiple LLC domains or not. Thoughts?
> 

The original idea was to calculate the statistics for the CPUs within
one LLC, and set the tag for that sched group as well as its sg_lb_stats
(but not at the sched domain scope). With this flag set in that sched group,
we can perform some comparisons in update_sd_pick_busiest() to determine if
that sched group has any tasks that need to be moved to other LLC sched 
groups.
If we do this in update_idle_cpu_scan(), might it be a bit late for
update_sd_pick_busiest()?

thanks,
Chenyu

>> +	struct sched_domain_shared *sd_share;
>> +
>> +	if (!sched_feat(SCHED_CACHE) || env->idle == CPU_NEWLY_IDLE)
>> +		return;
>> +
>> +	/* only care about sched domains spanning a LLC */
>> +	if (sd != rcu_dereference(per_cpu(sd_llc, env->dst_cpu)))
>> +		return;
>> +
>> +	/*
>> +	 * At this point we know this group spans a LLC domain.
>> +	 * Record the statistic of this group in its corresponding
>> +	 * shared LLC domain.
>> +	 */
>> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared,
>> +					   cpumask_first(sched_group_span(group))));
>> +	if (!sd_share)
>> +		return;
>> +
>> +	if (READ_ONCE(sd_share->util_avg) != sgs->group_util)
>> +		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
>> +
>> +	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
>> +		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
>> +}
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 12/19] sched/fair: Add migrate_llc_task migration type for cache-aware balancing
  2025-10-27  9:04   ` K Prateek Nayak
@ 2025-10-27 22:59     ` Tim Chen
  0 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-27 22:59 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Mon, 2025-10-27 at 14:34 +0530, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 10/11/2025 11:54 PM, Tim Chen wrote:
> > @@ -12149,6 +12167,16 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
> >  			}
> >  			break;
> >  
> > +		case migrate_llc_task:
> > +#ifdef CONFIG_SCHED_CACHE
> > +			dst_llc = llc_idx(env->dst_cpu);
> > +			if (!cpus_share_cache(env->dst_cpu, rq->cpu) &&
> 
> Busiest group is always a non-local group right? Can cpus_share_cache()
> ever happen given we are looking at groups of first !SD_LLC_SHARE
> domain for "migrate_llc_task"?

That's a good point. We should have already checked that busiest
and local are not in the same LLC when we mark migrate_llc_task.
We shouldn't need to do cpus_share_cache() check here.

Tim


> 
> > +			    busiest_pref_llc < rq->nr_pref_llc[dst_llc]) {
> > +				busiest_pref_llc = rq->nr_pref_llc[dst_llc];
> > +				busiest = rq;
> > +			}
> > +#endif
> > +			break;
> >  		case migrate_task:
> >  			if (busiest_nr < nr_running) {
> >  				busiest_nr = nr_running;

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 09/19] sched/fair: Count tasks prefering each LLC in a sched group
  2025-10-27  8:33   ` K Prateek Nayak
@ 2025-10-27 23:19     ` Tim Chen
  0 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-27 23:19 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Mon, 2025-10-27 at 14:03 +0530, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 10/11/2025 11:54 PM, Tim Chen wrote:
> > +#ifdef CONFIG_SCHED_CACHE
> > +		if (sched_cache_enabled()) {
> > +			int j;
> > +
> > +			for (j = 0; j < max_llcs; ++j)
> > +				sgs->nr_pref_llc[j] += rq->nr_pref_llc[j];
> > +		}
> > +#endif
> 
> If I'm not mistaken, we only compare
> "sds->nr_pref_llc[llc_idx(env->dst_cpu)]"
> and the destination LLC is always fixes. Do we need to aggregate the
> data for all the LLCs? Is a single "nr_pref_llc_dest" enough?

Yes. Only the nr_pref_llc entry corresponding to the destination
LLC is going to be used later to find either the LLC or run queue
to be chosen for balancing.  We can skip accounting for the other LLCs and save
some memory here.

Tim

> 
> >  		/*
> >  		 * No need to call idle_cpu() if nr_running is not 0
> >  		 */

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs
  2025-10-27 12:56     ` Chen, Yu C
@ 2025-10-27 23:36       ` Tim Chen
  2025-10-29 12:36         ` Chen, Yu C
  2025-10-28  2:46       ` K Prateek Nayak
  1 sibling, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-27 23:36 UTC (permalink / raw)
  To: Chen, Yu C, K Prateek Nayak
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

On Mon, 2025-10-27 at 20:56 +0800, Chen, Yu C wrote:
> Hi Prateek,
> 
> On 10/27/2025 1:42 PM, K Prateek Nayak wrote:
> > Hello Tim,
> > 
> > On 10/11/2025 11:54 PM, Tim Chen wrote:
> > > @@ -2530,10 +2531,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > >   				 * between LLCs and memory channels.
> > >   				 */
> > >   				nr_llcs = sd->span_weight / child->span_weight;
> > > -				if (nr_llcs == 1)
> > > +				if (nr_llcs == 1) {
> > >   					imb = sd->span_weight >> 3;
> > > -				else
> > > +				} else {
> > >   					imb = nr_llcs;
> > > +					has_multi_llcs = true;
> > 
> > One caution: this will not hold if all the CPUs aren't online during boot.
> > One case I can think of is when the kernel is booted with "maxcpus" cmdline
> > and CPUs are hotplugged later.
> > 
> > Unfortunately, I don't think we even have the raw topology data from the
> > arch/ side under such scenario to accurately make a call if the system
> > contains single or multiple LLC :(
> > 
> > I'm not sure if it is feasible but assuming the task_work() cannot run if
> > &sched_cache_allowed is false, can the fist instance of the task work for
> > sched_cache do the necessary setup?
> > 
> 
> build_sched_domains() might get invoked to rebuild the corresponding sched
> domains during CPU hotplug via cpuset subsystem. So if the CPU gets online
> after bootup, we still have the chance to detect multiple LLCs I suppose?

The case Pratek brought up of adding CPUs and enabling SCHED_CACHE 
should be covered. 

The trickier case is if we disable SCHED_CACHE when CPUs are
offlined and multi_cpus becomes false.  We'll need to clear out rq->nr_pref_llcs
data and tasks' preferred LLC would need to be cleared.  Or else the accounting
could be skewed we bring CPU online later and again re-enable SCHED_CACHE.
So far we haven't done that when we disable SCHED_CACHE from an enabled state.

Tim

> 
> I did a check on my VM:
> root@ubuntu:/sys/devices/system/cpu# lscpu
> CPU(s):                      32
>    On-line CPU(s) list:       0-7
> root@ubuntu:/sys/devices/system/cpu# echo 1 > cpu31/online
> Tracing ... Hit Ctrl-C to end.
> ^C
> 
> @build_sched_domains[
>      build_sched_domains+5
>      partition_sched_domains+613
>      cpuset_update_active_cpus+838
>      sched_cpu_activate+272
>      cpuhp_invoke_callback+340
>      cpuhp_thread_fun+139
>      smpboot_thread_fn+238
>      kthread+249
>      ret_from_fork+193
>      ret_from_fork_asm+26
> ]: 1
> 
> thanks,
> Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs
  2025-10-27 12:56     ` Chen, Yu C
  2025-10-27 23:36       ` Tim Chen
@ 2025-10-28  2:46       ` K Prateek Nayak
  1 sibling, 0 replies; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-28  2:46 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Tim Chen, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

Hello Chenyu,

On 10/27/2025 6:26 PM, Chen, Yu C wrote:
> build_sched_domains() might get invoked to rebuild the corresponding sched
> domains during CPU hotplug via cpuset subsystem. So if the CPU gets online
> after bootup, we still have the chance to detect multiple LLCs I suppose?

Ah yes! Thank you for confirming.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 02/19] sched/fair: Record per-LLC utilization to guide cache-aware scheduling decisions
  2025-10-27 14:07     ` Chen, Yu C
@ 2025-10-28  2:50       ` K Prateek Nayak
  0 siblings, 0 replies; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-28  2:50 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Ingo Molnar, Adam Li, Tim Chen, linux-kernel, Tim Chen,
	Peter Zijlstra, Gautham R . Shenoy

Hello Chenyu,

On 10/27/2025 7:37 PM, Chen, Yu C wrote:
> Hi Prateek,
> 
> On 10/27/2025 1:01 PM, K Prateek Nayak wrote:
>> Hello Tim,
>>
>> On 10/11/2025 11:54 PM, Tim Chen wrote:
>>> +#ifdef CONFIG_SCHED_CACHE
>>> +/*
>>> + * Record the statistics for this scheduler group for later
>>> + * use. These values guide load balancing on aggregating tasks
>>> + * to a LLC.
>>> + */
>>> +static void record_sg_llc_stats(struct lb_env *env,
>>> +                struct sg_lb_stats *sgs,
>>> +                struct sched_group *group)
>>> +{
>>> +    /*
>>> +     * Find the child domain on env->dst_cpu. This domain
>>> +     * is either the domain that spans this group(if the
>>> +     * group is a local group), or the sibling domain of
>>> +     * this group.
>>> +     */
>>> +    struct sched_domain *sd = env->sd->child;
>>
>> Was this intentionally done to limit the update to sg_llc_stats to the
>> load balancing period of "sd_llc->parent"?
>>
>> Can't this be done with update_idle_cpu_scan()? I believe it is more
>> frequent, "sds->total_capacity" from caller gives you the equivalent of
>> "group_capacity", and "group_util" is already calculated as "sum_util".
>>
>> Checking "sd_llc->parent" there should be sufficient to check if there
>> are multiple LLC domains or not. Thoughts?
>>
> 
> The original idea was to calculate the statistics for the CPUs within
> one LLC, and set the tag for that sched group as well as its sg_lb_stats
> (but not at the sched domain scope). With this flag set in that sched group,
> we can perform some comparisons in update_sd_pick_busiest() to determine if
> that sched group has any tasks that need to be moved to other LLC sched groups.
> If we do this in update_idle_cpu_scan(), might it be a bit late for
> update_sd_pick_busiest()?

Once I got to Patch 10, the location of record_sg_llc_stats() became
more clear w.r.t. the following call to llc_balance(). Thank you for
clarifying and sorry for the noise.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-11 18:24 ` [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach Tim Chen
@ 2025-10-28  6:02   ` K Prateek Nayak
  2025-10-28 11:58     ` Chen, Yu C
  0 siblings, 1 reply; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-28  6:02 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

Hello Tim,

On 10/11/2025 11:54 PM, Tim Chen wrote:
> @@ -9969,6 +9969,12 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	if (env->flags & LBF_ACTIVE_LB)
>  		return 1;
>  
> +#ifdef CONFIG_SCHED_CACHE
> +	if (sched_cache_enabled() &&
> +	    can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) == mig_forbid)
> +		return 0;
> +#endif
> +
>  	degrades = migrate_degrades_locality(p, env);
>  	if (!degrades)
>  		hot = task_hot(p, env);

Should we care for task_hot() w.r.t. migration cost if a task is being
moved to a preferred LLC?

Also, should we leave out tasks under core scheduling from the llc
aware lb? Even discount them when calculating "mm->nr_running_avg"?

> @@ -10227,6 +10233,20 @@ static int detach_tasks(struct lb_env *env)
>  		if (env->imbalance <= 0)
>  			break;
>  
> +#ifdef CONFIG_SCHED_CACHE
> +		/*
> +		 * Don't detach more tasks if the remaining tasks want
> +		 * to stay. We know the remaining tasks all prefer the
> +		 * current LLC, because after order_tasks_by_llc(), the
> +		 * tasks that prefer the current LLC are at the tail of
> +		 * the list. The inhibition of detachment is to avoid too
> +		 * many tasks being migrated out of the preferred LLC.
> +		 */
> +		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
> +		    llc_id(env->src_cpu) == p->preferred_llc)
> +			break;

In all cases? Should we check can_migrate_llc() wrt to util migrated and
then make a call if we should move the preferred LLC tasks or not?

Perhaps disallow it the first time if "nr_balance_failed" is 0 but
subsequent failed attempts should perhaps explore breaking the preferred
llc restriction if there is an imbalance and we are under
"mig_unrestricted" conditions.

> +#endif
> +
>  		continue;
>  next:
>  		if (p->sched_task_hot)

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-28  6:02   ` K Prateek Nayak
@ 2025-10-28 11:58     ` Chen, Yu C
  2025-10-28 15:30       ` Tim Chen
  2025-10-29  3:54       ` K Prateek Nayak
  0 siblings, 2 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-28 11:58 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Tim Chen, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

Hi Prateek,

On 10/28/2025 2:02 PM, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 10/11/2025 11:54 PM, Tim Chen wrote:
>> @@ -9969,6 +9969,12 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>   	if (env->flags & LBF_ACTIVE_LB)
>>   		return 1;
>>   
>> +#ifdef CONFIG_SCHED_CACHE
>> +	if (sched_cache_enabled() &&
>> +	    can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) == mig_forbid)
>> +		return 0;
>> +#endif
>> +
>>   	degrades = migrate_degrades_locality(p, env);
>>   	if (!degrades)
>>   		hot = task_hot(p, env);
> 
> Should we care for task_hot() w.r.t. migration cost if a task is being
> moved to a preferred LLC?
> 

This is a good question. The decision not to migrate a task when its
LLC preference is violated takes priority over the check in task_hot().

The main reason is that we want cache aware aggregation to be more
aggressive than generic migration; otherwise, cache-aware migration
  might not take effect according to our previous test. This seems to
be a trade-off. Another consideration might be: should we consider
the occupancy of a single thread or that of the entire process?
For example, suppose t0, t1, and t2 belong to the same process. t0
and t1 are running on the process's preferred LLC0, while t2 is
running on the non-preferred LLC1. Even though t2 has high occupancy
on LLC1 (making it cache-hot on LLC1), we might still want to move t2
to LLC0 if t0, t1, and t2 read from and write to each other - since we 
don't want to generate cross-LLC access.

> Also, should we leave out tasks under core scheduling from the llc
> aware lb? Even discount them when calculating "mm->nr_running_avg"?
> 
Yes, it seems that the cookie match check case was missed, which is
embedded in task_hot(). I suppose you are referring to the p->core_cookie
check; I'll look into this direction.

>> @@ -10227,6 +10233,20 @@ static int detach_tasks(struct lb_env *env)
>>   		if (env->imbalance <= 0)
>>   			break;
>>   
>> +#ifdef CONFIG_SCHED_CACHE
>> +		/*
>> +		 * Don't detach more tasks if the remaining tasks want
>> +		 * to stay. We know the remaining tasks all prefer the
>> +		 * current LLC, because after order_tasks_by_llc(), the
>> +		 * tasks that prefer the current LLC are at the tail of
>> +		 * the list. The inhibition of detachment is to avoid too
>> +		 * many tasks being migrated out of the preferred LLC.
>> +		 */
>> +		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
>> +		    llc_id(env->src_cpu) == p->preferred_llc)
>> +			break;
> 
> In all cases? Should we check can_migrate_llc() wrt to util migrated and
> then make a call if we should move the preferred LLC tasks or not?
> 

Prior to this "stop of detaching tasks", we performed a can_migrate_task(p)
to determine if the detached p is dequeued from its preferred LLC, and in
can_migrate_task(), we use can_migrate_llc_task() -> can_migrate_llc() to
carry out the check. That is to say, only when certain tasks have been
detached, will we stop further detaching.

> Perhaps disallow it the first time if "nr_balance_failed" is 0 but
> subsequent failed attempts should perhaps explore breaking the preferred
> llc restriction if there is an imbalance and we are under
> "mig_unrestricted" conditions.
> 

I suppose you are suggesting that the threshold for stopping task 
detachment
should be higher. With the above can_migrate_llc() check, I suppose we have
raised the threshold for stopping "task detachment"?

thanks,
Chenyu

>> +#endif
>> +
>>   		continue;
>>   next:
>>   		if (p->sched_task_hot)
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing
  2025-10-27  6:29   ` K Prateek Nayak
@ 2025-10-28 12:11     ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-28 12:11 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Tim Chen, Ingo Molnar,
	Gautham R . Shenoy, Peter Zijlstra

On 10/27/2025 2:29 PM, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 10/11/2025 11:54 PM, Tim Chen wrote:
>> +/*
>> + * Do LLC balance on sched group that contains LLC, and have tasks preferring
>> + * to run on LLC in idle dst_cpu.
>> + */
>> +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
>> +			       struct sched_group *group)
>> +{
>> +	struct sched_domain *child = env->sd->child;
>> +	int llc;
>> +
>> +	if (!sched_cache_enabled())
>> +		return false;
>> +
>> +	if (env->sd->flags & SD_SHARE_LLC)
>> +		return false;
>> +
>> +	/* only care about task migration among LLCs */
>> +	if (child && !(child->flags & SD_SHARE_LLC))
> 
> nit. You can just check group->flags here.
> 

Got it, we will simplify the code.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue
  2025-10-27  6:04   ` K Prateek Nayak
@ 2025-10-28 15:15     ` Chen, Yu C
  2025-10-28 15:46       ` Tim Chen
  2025-10-29  4:00       ` K Prateek Nayak
  2025-10-28 17:06     ` Tim Chen
  1 sibling, 2 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-28 15:15 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Tim Chen, Peter Zijlstra,
	Ingo Molnar, Gautham R . Shenoy

On 10/27/2025 2:04 PM, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 10/11/2025 11:54 PM, Tim Chen wrote:
>> @@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>   		struct rq *rq = rq_of(cfs_rq);
>>   
>>   		account_numa_enqueue(rq, task_of(se));
>> +		account_llc_enqueue(rq, task_of(se));
>>   		list_add(&se->group_node, &rq->cfs_tasks);
>>   	}
>>   	cfs_rq->nr_queued++;
>> @@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>   	update_load_sub(&cfs_rq->load, se->load.weight);
>>   	if (entity_is_task(se)) {
>>   		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
>> +		account_llc_dequeue(rq_of(cfs_rq), task_of(se));
>>   		list_del_init(&se->group_node);
>>   	}
>>   	cfs_rq->nr_queued--;
>> +
>> +	/* safeguard to clear the cache aware data */
>> +	if (!parent_entity(se) && !cfs_rq->nr_queued)
>> +		reset_llc_stats(rq_of(cfs_rq));
> 
> Instead of relying on reset_llc_stats() hack, I think a better approach
> would be to have a "p->se.llc_sched_active" flag similar to how uclamp
> has "uc_se->active" and we set this in account_llc_enqueue() which will
> still check for sched_cache_enabled() but account_llc_dequeue() would
> only check for "p->se.llc_sched_active" to decrement the stats and then
> unset the flag.
> 
> That way, we cannot have an imbalanced accounting. Thoughts?
> 

I suppose what you mean is to avoid the race condition between
enabling sched_cache and EQ/DE_LLC, similar to uclamp:

         enqueue(taskA)
         // sched_cache gets enabled
         enqueue(taskB)
         dequeue(taskA)
         // Must not decrement rq->llc_pref for taskA
         dequeue(taskB)

We'll think more about this.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-28 11:58     ` Chen, Yu C
@ 2025-10-28 15:30       ` Tim Chen
  2025-10-29  4:15         ` K Prateek Nayak
  2025-10-29  3:54       ` K Prateek Nayak
  1 sibling, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-28 15:30 UTC (permalink / raw)
  To: Chen, Yu C, K Prateek Nayak
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

On Tue, 2025-10-28 at 19:58 +0800, Chen, Yu C wrote:
> Hi Prateek,
> 
> On 10/28/2025 2:02 PM, K Prateek Nayak wrote:
> > Hello Tim,
> > 
> > On 10/11/2025 11:54 PM, Tim Chen wrote:
> > > @@ -9969,6 +9969,12 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> > >   	if (env->flags & LBF_ACTIVE_LB)
> > >   		return 1;
> > >   
> > > +#ifdef CONFIG_SCHED_CACHE
> > > +	if (sched_cache_enabled() &&
> > > +	    can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) == mig_forbid)
> > > +		return 0;
> > > +#endif
> > > +
> > >   	degrades = migrate_degrades_locality(p, env);
> > >   	if (!degrades)
> > >   		hot = task_hot(p, env);
> > 
> > Should we care for task_hot() w.r.t. migration cost if a task is being
> > moved to a preferred LLC?
> > 
> 
> This is a good question. The decision not to migrate a task when its
> LLC preference is violated takes priority over the check in task_hot().
> 
> The main reason is that we want cache aware aggregation to be more
> aggressive than generic migration; otherwise, cache-aware migration
>   might not take effect according to our previous test. This seems to
> be a trade-off. Another consideration might be: should we consider
> the occupancy of a single thread or that of the entire process?
> For example, suppose t0, t1, and t2 belong to the same process. t0
> and t1 are running on the process's preferred LLC0, while t2 is
> running on the non-preferred LLC1. Even though t2 has high occupancy
> on LLC1 (making it cache-hot on LLC1), we might still want to move t2
> to LLC0 if t0, t1, and t2 read from and write to each other - since we 
> don't want to generate cross-LLC access.
> 
> > Also, should we leave out tasks under core scheduling from the llc
> > aware lb? Even discount them when calculating "mm->nr_running_avg"?
> > 
> Yes, it seems that the cookie match check case was missed, which is
> embedded in task_hot(). I suppose you are referring to the p->core_cookie
> check; I'll look into this direction.
> 
> > > @@ -10227,6 +10233,20 @@ static int detach_tasks(struct lb_env *env)
> > >   		if (env->imbalance <= 0)
> > >   			break;
> > >   
> > > +#ifdef CONFIG_SCHED_CACHE
> > > +		/*
> > > +		 * Don't detach more tasks if the remaining tasks want
> > > +		 * to stay. We know the remaining tasks all prefer the
> > > +		 * current LLC, because after order_tasks_by_llc(), the
> > > +		 * tasks that prefer the current LLC are at the tail of
> > > +		 * the list. The inhibition of detachment is to avoid too
> > > +		 * many tasks being migrated out of the preferred LLC.
> > > +		 */
> > > +		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
> > > +		    llc_id(env->src_cpu) == p->preferred_llc)
> > > +			break;
> > 
> > In all cases? Should we check can_migrate_llc() wrt to util migrated and
> > then make a call if we should move the preferred LLC tasks or not?
> > 
> 
> Prior to this "stop of detaching tasks", we performed a can_migrate_task(p)
> to determine if the detached p is dequeued from its preferred LLC, and in
> can_migrate_task(), we use can_migrate_llc_task() -> can_migrate_llc() to
> carry out the check. That is to say, only when certain tasks have been
> detached, will we stop further detaching.
> 
> > Perhaps disallow it the first time if "nr_balance_failed" is 0 but
> > subsequent failed attempts should perhaps explore breaking the preferred
> > llc restriction if there is an imbalance and we are under
> > "mig_unrestricted" conditions.
> > 
> 

Pratek,

We have to actually allow for imbalance between LLCs with task
aggregation.

Say we have 2 LLCs and only one process running. Suppose all tasks in the process
can fit in one LLC and not overload it. Then we should not pull tasks from
the preferred LLC, and allow the imbalance. If we balance the tasks the
second time around, that will defeat the purpose.

That's why we have the knob llc_overload_pct (50%), which will start spreading
tasks to non-preferred LLC once load in preferred LLC excees 50%.
And llc_imb_pct(20%), which allows for a 20% higher load between preferred LLC
and non-preferred LLC if the preferred LLC is operating above 50%.

So if we ignore the LLC policy totally the second time around, we may be breaking
LLC aggregation and have tasks be moved to their non-preferred LLC.

Will take a closer look to see if nr_balance_failed > 0
because we cannot move tasks to their preferred LLC repeatedly, and if
we should do anything different to balance tasks better without violating
LLC preference.

Tim

> I suppose you are suggesting that the threshold for stopping task 
> detachment
> should be higher. With the above can_migrate_llc() check, I suppose we have
> raised the threshold for stopping "task detachment"?
> 
> thanks,
> Chenyu
> 
> > > +#endif
> > > +
> > >   		continue;
> > >   next:
> > >   		if (p->sched_task_hot)
> > 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue
  2025-10-28 15:15     ` Chen, Yu C
@ 2025-10-28 15:46       ` Tim Chen
  2025-10-29  4:32         ` K Prateek Nayak
  2025-10-29  4:00       ` K Prateek Nayak
  1 sibling, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-28 15:46 UTC (permalink / raw)
  To: Chen, Yu C, K Prateek Nayak
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Gautham R . Shenoy

On Tue, 2025-10-28 at 23:15 +0800, Chen, Yu C wrote:
> On 10/27/2025 2:04 PM, K Prateek Nayak wrote:
> > Hello Tim,
> > 
> > On 10/11/2025 11:54 PM, Tim Chen wrote:
> > > @@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > >   		struct rq *rq = rq_of(cfs_rq);
> > >   
> > >   		account_numa_enqueue(rq, task_of(se));
> > > +		account_llc_enqueue(rq, task_of(se));
> > >   		list_add(&se->group_node, &rq->cfs_tasks);
> > >   	}
> > >   	cfs_rq->nr_queued++;
> > > @@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > >   	update_load_sub(&cfs_rq->load, se->load.weight);
> > >   	if (entity_is_task(se)) {
> > >   		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
> > > +		account_llc_dequeue(rq_of(cfs_rq), task_of(se));
> > >   		list_del_init(&se->group_node);
> > >   	}
> > >   	cfs_rq->nr_queued--;
> > > +
> > > +	/* safeguard to clear the cache aware data */
> > > +	if (!parent_entity(se) && !cfs_rq->nr_queued)
> > > +		reset_llc_stats(rq_of(cfs_rq));
> > 
> > Instead of relying on reset_llc_stats() hack, I think a better approach
> > would be to have a "p->se.llc_sched_active" flag similar to how uclamp
> > has "uc_se->active" and we set this in account_llc_enqueue() which will
> > still check for sched_cache_enabled() but account_llc_dequeue() would
> > only check for "p->se.llc_sched_active" to decrement the stats and then
> > unset the flag.
> > 
> > That way, we cannot have an imbalanced accounting. Thoughts?
> > 
> 
> I suppose what you mean is to avoid the race condition between
> enabling sched_cache and EQ/DE_LLC, similar to uclamp:
> 
>          enqueue(taskA)
>          // sched_cache gets enabled
>          enqueue(taskB)
>          dequeue(taskA)
>          // Must not decrement rq->llc_pref for taskA

For this case, task A is already on rq when sched cache get
enabled. But task A's preferred_llc is still -1. 

If we dequeue it while its preferred_llc is still -1, it won't
affect rq->llc_pref.

If we change its preferred_llc to llc_i before we dequeue it,
then rq->llc_pref[llc_i] will be incremented first.

Then when we dequeue task A, we will decrement it. We are
still accounting rq->llc_pref[llc_i] correctly with current
code.

The trickier case is if we need to dynamically resize
rq->llc_pref[]. We need to make sure that we lock the rq
to prevent enqueue/dequeue, switch it to a larger size
rq->llc_pref[], copy the old data over, then switch over
to the larger sized rq->llc_pref[] and unlock rq to keep
the accounting straight.

Tim 

>          dequeue(taskB)
> 
> We'll think more about this.




> 
> thanks,
> Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue
  2025-10-27  6:04   ` K Prateek Nayak
  2025-10-28 15:15     ` Chen, Yu C
@ 2025-10-28 17:06     ` Tim Chen
  1 sibling, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-10-28 17:06 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Mon, 2025-10-27 at 11:34 +0530, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 10/11/2025 11:54 PM, Tim Chen wrote:
> > @@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  		struct rq *rq = rq_of(cfs_rq);
> >  
> >  		account_numa_enqueue(rq, task_of(se));
> > +		account_llc_enqueue(rq, task_of(se));
> >  		list_add(&se->group_node, &rq->cfs_tasks);
> >  	}
> >  	cfs_rq->nr_queued++;
> > @@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  	update_load_sub(&cfs_rq->load, se->load.weight);
> >  	if (entity_is_task(se)) {
> >  		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
> > +		account_llc_dequeue(rq_of(cfs_rq), task_of(se));
> >  		list_del_init(&se->group_node);
> >  	}
> >  	cfs_rq->nr_queued--;
> > +
> > +	/* safeguard to clear the cache aware data */
> > +	if (!parent_entity(se) && !cfs_rq->nr_queued)
> > +		reset_llc_stats(rq_of(cfs_rq));
> 
> Instead of relying on reset_llc_stats() hack, I think a better approach
> would be to have a "p->se.llc_sched_active" flag similar to how uclamp
> has "uc_se->active" and we set this in account_llc_enqueue() which will
> still check for sched_cache_enabled() but account_llc_dequeue() would
> only check for "p->se.llc_sched_active" to decrement the stats and then
> unset the flag.
> 
> That way, we cannot have an imbalanced accounting. Thoughts?

With our current accounting method, we should not have imbalanced
accounting even if you turn on sched_cache after the
scheduler has started running (see my reply to Chen Yu's follow up).
That reset_llc_stats() hack
should not be needed as Peter pointed out.

We will change that check to warning under debug option.

Tim

> 
> >  }
> >  

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-28 11:58     ` Chen, Yu C
  2025-10-28 15:30       ` Tim Chen
@ 2025-10-29  3:54       ` K Prateek Nayak
  2025-10-29 14:23         ` Chen, Yu C
  2025-10-29 21:09         ` Tim Chen
  1 sibling, 2 replies; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-29  3:54 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Tim Chen, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

Hello Chenyu,

On 10/28/2025 5:28 PM, Chen, Yu C wrote:
> Hi Prateek,
> 
> On 10/28/2025 2:02 PM, K Prateek Nayak wrote:
>> Hello Tim,
>>
>> On 10/11/2025 11:54 PM, Tim Chen wrote:
>>> @@ -9969,6 +9969,12 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>>       if (env->flags & LBF_ACTIVE_LB)
>>>           return 1;
>>>   +#ifdef CONFIG_SCHED_CACHE
>>> +    if (sched_cache_enabled() &&
>>> +        can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) == mig_forbid)
>>> +        return 0;
>>> +#endif
>>> +
>>>       degrades = migrate_degrades_locality(p, env);
>>>       if (!degrades)
>>>           hot = task_hot(p, env);
>>
>> Should we care for task_hot() w.r.t. migration cost if a task is being
>> moved to a preferred LLC?
>>
> 
> This is a good question. The decision not to migrate a task when its
> LLC preference is violated takes priority over the check in task_hot().
> 
> The main reason is that we want cache aware aggregation to be more
> aggressive than generic migration; otherwise, cache-aware migration
>  might not take effect according to our previous test. This seems to
> be a trade-off. Another consideration might be: should we consider
> the occupancy of a single thread or that of the entire process?
> For example, suppose t0, t1, and t2 belong to the same process. t0
> and t1 are running on the process's preferred LLC0, while t2 is
> running on the non-preferred LLC1. Even though t2 has high occupancy
> on LLC1 (making it cache-hot on LLC1), we might still want to move t2
> to LLC0 if t0, t1, and t2 read from and write to each other - since we don't want to generate cross-LLC access.

Makes sense. That would need some heuristics based on the avg_running
to know which LLC can be be a potential target with fewest migrations.
But then again, in a dynamic system things change so quickly - what
you have now seems to be a good start to further optimize on top of.

> 
>> Also, should we leave out tasks under core scheduling from the llc
>> aware lb? Even discount them when calculating "mm->nr_running_avg"?
>>
> Yes, it seems that the cookie match check case was missed, which is
> embedded in task_hot(). I suppose you are referring to the p->core_cookie
> check; I'll look into this direction.

Yup! I think if user has opted into core scheduling, they should ideally
not bother about cache aware scheduling.

> 
>>> @@ -10227,6 +10233,20 @@ static int detach_tasks(struct lb_env *env)
>>>           if (env->imbalance <= 0)
>>>               break;
>>>   +#ifdef CONFIG_SCHED_CACHE
>>> +        /*
>>> +         * Don't detach more tasks if the remaining tasks want
>>> +         * to stay. We know the remaining tasks all prefer the
>>> +         * current LLC, because after order_tasks_by_llc(), the
>>> +         * tasks that prefer the current LLC are at the tail of
>>> +         * the list. The inhibition of detachment is to avoid too
>>> +         * many tasks being migrated out of the preferred LLC.
>>> +         */
>>> +        if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
>>> +            llc_id(env->src_cpu) == p->preferred_llc)
>>> +            break;
>>
>> In all cases? Should we check can_migrate_llc() wrt to util migrated and
>> then make a call if we should move the preferred LLC tasks or not?
>>
> 
> Prior to this "stop of detaching tasks", we performed a can_migrate_task(p)
> to determine if the detached p is dequeued from its preferred LLC, and in
> can_migrate_task(), we use can_migrate_llc_task() -> can_migrate_llc() to
> carry out the check. That is to say, only when certain tasks have been
> detached, will we stop further detaching.
> 
>> Perhaps disallow it the first time if "nr_balance_failed" is 0 but
>> subsequent failed attempts should perhaps explore breaking the preferred
>> llc restriction if there is an imbalance and we are under
>> "mig_unrestricted" conditions.
>>
> 
> I suppose you are suggesting that the threshold for stopping task detachment
> should be higher. With the above can_migrate_llc() check, I suppose we have
> raised the threshold for stopping "task detachment"?

Say the LLC is under heavy load and we only have overloaded groups.
can_migrate_llc() would return "mig_unrestricted" since
fits_llc_capacity() would return false.

Since we are under "migrate_load", sched_balance_find_src_rq() has
returned the CPU with the highest load which could very well be the
CPU with with a large number of preferred LLC tasks.

sched_cache_enabled() is still true and when detach_tasks() reaches
one of these preferred llc tasks (which comes at the very end of the
tasks list), we break out even if env->imbalance > 0 leaving
potential imbalance for the "migrate_load" case.

Instead, we can account for the util moved out of the src_llc and
after accounting for it, check if can_migrate_llc() would return
"mig_forbid" for the src llc.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue
  2025-10-28 15:15     ` Chen, Yu C
  2025-10-28 15:46       ` Tim Chen
@ 2025-10-29  4:00       ` K Prateek Nayak
  1 sibling, 0 replies; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-29  4:00 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Tim Chen, Peter Zijlstra,
	Ingo Molnar, Gautham R . Shenoy

Hello Chenyu,

On 10/28/2025 8:45 PM, Chen, Yu C wrote:
>> Instead of relying on reset_llc_stats() hack, I think a better approach
>> would be to have a "p->se.llc_sched_active" flag similar to how uclamp
>> has "uc_se->active" and we set this in account_llc_enqueue() which will
>> still check for sched_cache_enabled() but account_llc_dequeue() would
>> only check for "p->se.llc_sched_active" to decrement the stats and then
>> unset the flag.
>>
>> That way, we cannot have an imbalanced accounting. Thoughts?
>>
> 
> I suppose what you mean is to avoid the race condition between
> enabling sched_cache and EQ/DE_LLC, similar to uclamp:
> 
>         enqueue(taskA)
>         // sched_cache gets enabled
>         enqueue(taskB)
>         dequeue(taskA)
>         // Must not decrement rq->llc_pref for taskA
>         dequeue(taskB)

Yup! We can have

  enqueue(p)
    account_llc_enqueue(p)
      if (sched_cache_enabled())
        p->se.llc_sched_active = true;
        rq->nr_llc_running += (p->preferred_llc != -1);
        rq->nr_pref_llc_running += (p->preferred_llc == task_llc(p));

    ...

  dequeue(p)
    account_llc_dequeue(p)
      if (p->se.llc_sched_active)
        rq->nr_llc_running -= (p->preferred_llc != -1);
        rq->nr_pref_llc_running -= (p->preferred_llc == task_llc(p));


We can also have a single bit for "llc_sched_active" in the
task_struct next to the "sched_task_hot" bit instead of using the
hole in sched_entity after "sched_delayed"

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-28 15:30       ` Tim Chen
@ 2025-10-29  4:15         ` K Prateek Nayak
  0 siblings, 0 replies; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-29  4:15 UTC (permalink / raw)
  To: Tim Chen, Chen, Yu C
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

Hello Tim,

On 10/28/2025 9:00 PM, Tim Chen wrote:
>>>> +#ifdef CONFIG_SCHED_CACHE
>>>> +		/*
>>>> +		 * Don't detach more tasks if the remaining tasks want
>>>> +		 * to stay. We know the remaining tasks all prefer the
>>>> +		 * current LLC, because after order_tasks_by_llc(), the
>>>> +		 * tasks that prefer the current LLC are at the tail of
>>>> +		 * the list. The inhibition of detachment is to avoid too
>>>> +		 * many tasks being migrated out of the preferred LLC.
>>>> +		 */
>>>> +		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
>>>> +		    llc_id(env->src_cpu) == p->preferred_llc)
>>>> +			break;
>>>
>>> In all cases? Should we check can_migrate_llc() wrt to util migrated and
>>> then make a call if we should move the preferred LLC tasks or not?
>>>
>>
>> Prior to this "stop of detaching tasks", we performed a can_migrate_task(p)
>> to determine if the detached p is dequeued from its preferred LLC, and in
>> can_migrate_task(), we use can_migrate_llc_task() -> can_migrate_llc() to
>> carry out the check. That is to say, only when certain tasks have been
>> detached, will we stop further detaching.
>>
>>> Perhaps disallow it the first time if "nr_balance_failed" is 0 but
>>> subsequent failed attempts should perhaps explore breaking the preferred
>>> llc restriction if there is an imbalance and we are under
>>> "mig_unrestricted" conditions.
>>>
>>
> 
> Pratek,
> 
> We have to actually allow for imbalance between LLCs with task
> aggregation.
> 
> Say we have 2 LLCs and only one process running. Suppose all tasks in the process
> can fit in one LLC and not overload it. Then we should not pull tasks from
> the preferred LLC, and allow the imbalance. If we balance the tasks the
> second time around, that will defeat the purpose.
> 
> That's why we have the knob llc_overload_pct (50%), which will start spreading
> tasks to non-preferred LLC once load in preferred LLC excees 50%.
> And llc_imb_pct(20%), which allows for a 20% higher load between preferred LLC
> and non-preferred LLC if the preferred LLC is operating above 50%.
> 
> So if we ignore the LLC policy totally the second time around, we may be breaking
> LLC aggregation and have tasks be moved to their non-preferred LLC.

Ack! I have replied to Chenyu's response with an example of
"migrate_load" case that, as per my understanding, would be restricted
by this condition. If I'm missing something, please do let me know.
Otherwise, the intention looks good to me.

> 
> Will take a closer look to see if nr_balance_failed > 0
> because we cannot move tasks to their preferred LLC repeatedly, and if
> we should do anything different to balance tasks better without violating
> LLC preference.

Thank you!

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue
  2025-10-28 15:46       ` Tim Chen
@ 2025-10-29  4:32         ` K Prateek Nayak
  2025-10-29 12:48           ` Chen, Yu C
  0 siblings, 1 reply; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-29  4:32 UTC (permalink / raw)
  To: Tim Chen, Chen, Yu C
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Gautham R . Shenoy

Hello Tim,

On 10/28/2025 9:16 PM, Tim Chen wrote:
> On Tue, 2025-10-28 at 23:15 +0800, Chen, Yu C wrote:
>> On 10/27/2025 2:04 PM, K Prateek Nayak wrote:
>>> Hello Tim,
>>>
>>> On 10/11/2025 11:54 PM, Tim Chen wrote:
>>>> @@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>   		struct rq *rq = rq_of(cfs_rq);
>>>>   
>>>>   		account_numa_enqueue(rq, task_of(se));
>>>> +		account_llc_enqueue(rq, task_of(se));
>>>>   		list_add(&se->group_node, &rq->cfs_tasks);
>>>>   	}
>>>>   	cfs_rq->nr_queued++;
>>>> @@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>   	update_load_sub(&cfs_rq->load, se->load.weight);
>>>>   	if (entity_is_task(se)) {
>>>>   		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
>>>> +		account_llc_dequeue(rq_of(cfs_rq), task_of(se));
>>>>   		list_del_init(&se->group_node);
>>>>   	}
>>>>   	cfs_rq->nr_queued--;
>>>> +
>>>> +	/* safeguard to clear the cache aware data */
>>>> +	if (!parent_entity(se) && !cfs_rq->nr_queued)
>>>> +		reset_llc_stats(rq_of(cfs_rq));
>>>
>>> Instead of relying on reset_llc_stats() hack, I think a better approach
>>> would be to have a "p->se.llc_sched_active" flag similar to how uclamp
>>> has "uc_se->active" and we set this in account_llc_enqueue() which will
>>> still check for sched_cache_enabled() but account_llc_dequeue() would
>>> only check for "p->se.llc_sched_active" to decrement the stats and then
>>> unset the flag.
>>>
>>> That way, we cannot have an imbalanced accounting. Thoughts?
>>>
>>
>> I suppose what you mean is to avoid the race condition between
>> enabling sched_cache and EQ/DE_LLC, similar to uclamp:
>>
>>          enqueue(taskA)
>>          // sched_cache gets enabled
>>          enqueue(taskB)
>>          dequeue(taskA)
>>          // Must not decrement rq->llc_pref for taskA
> 
> For this case, task A is already on rq when sched cache get
> enabled. But task A's preferred_llc is still -1. 
> 
> If we dequeue it while its preferred_llc is still -1, it won't
> affect rq->llc_pref.
> 
> If we change its preferred_llc to llc_i before we dequeue it,
> then rq->llc_pref[llc_i] will be incremented first.
> 
> Then when we dequeue task A, we will decrement it. We are
> still accounting rq->llc_pref[llc_i] correctly with current
> code.

So what I really disliked was having reset_llc_stats() to
reset the stat but looking at it again, that too is guarded
by sched_cache_enabled() counter so I think the counters can
still go out of balance if:

    /* Cache aware scheduling enabled */
    enqueue(TaskA) /* nr_llc_running = 1 */
    enqueue(TaskB) /* nr_llc_running = 2 */
    enqueue(TaskC) /* nr_llc_running = 3 */
    dequeue(TaskA) /* nr_llc_running = 2 */

    /* Cache aware scheduling disabled */

   dequeue(TaskB) /* nr_llc_running = 2 */
   dequeue(TaskC) /* nr_llc_running = 2 */

   /* nr_running == 0; nr_llc_running = 2 */
   /* Cache aware scheduling enabled again */

   enqueue(TaskD) /* nr_llc_running = 3 */
   enqueue(TaskE) /* nr_llc_running = 4 */

   ...

At some later point if nr_running reaches 0 again, then
"nr_llc_running" is finally reset to 0 but until then it
can show inaccurate value if users repeatedly toggle the
feature depending on the workload running.

> 
> The trickier case is if we need to dynamically resize
> rq->llc_pref[]. We need to make sure that we lock the rq
> to prevent enqueue/dequeue, switch it to a larger size
> rq->llc_pref[], copy the old data over, then switch over
> to the larger sized rq->llc_pref[] and unlock rq to keep
> the accounting straight.

When that happens, we'll have to modify sched domains since
something has changed in the system / cpuset and rq_attach_root()
could be a good place to do it at when rq is offlined and onlined
once again with the rq_lock held.

Only issue is if the partition splits the LLC in which case, we'll
have two LLC index - one for for first half and another for the
second half and we'll have to re-account the task to this new
index.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 19/19] sched/fair: Add user control to adjust the tolerance of cache-aware scheduling
  2025-10-11 18:24 ` [PATCH 19/19] sched/fair: Add user control to adjust the tolerance of cache-aware scheduling Tim Chen
@ 2025-10-29  8:07   ` Aaron Lu
  2025-10-29 12:54     ` Chen, Yu C
  0 siblings, 1 reply; 116+ messages in thread
From: Aaron Lu @ 2025-10-29  8:07 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Chen Yu, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Sat, Oct 11, 2025 at 11:24:56AM -0700, Tim Chen wrote:
... ...
> +static inline int get_sched_cache_cap_scale(void)
> +{
> +	return (llc_overload_pct / cpu_smt_num_threads);
> +}
> +
... ...
> @@ -9749,7 +9811,7 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
>   * (default: ~50%)
>   */
>  #define fits_llc_capacity(util, max)	\
> -	((util) * 100 < (max) * llc_overload_pct)
> +	((util) * 100 < (max) * get_sched_cache_cap_scale())
>

With this change, fits_llc_capacity() would be false if util is just 1/4
of max(which is llc's total cpu capacity), is this intended?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing
  2025-10-27  2:00     ` Chen, Yu C
@ 2025-10-29  9:51       ` Aaron Lu
  2025-10-29 13:19         ` Chen, Yu C
  0 siblings, 1 reply; 116+ messages in thread
From: Aaron Lu @ 2025-10-29  9:51 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Libo Chen, Adam Li, Tim Chen, linux-kernel

On Mon, Oct 27, 2025 at 10:00:52AM +0800, Chen, Yu C wrote:
> Hi Aaron,
>
> On 10/24/2025 5:32 PM, Aaron Lu wrote:
> > Hi Tim,
> >
> > On Sat, Oct 11, 2025 at 11:24:47AM -0700, Tim Chen wrote:
> > > @@ -10849,11 +10849,45 @@ static void record_sg_llc_stats(struct lb_env *env,
> > >   	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
> > >   		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
> > >   }
> > > +
> > > +/*
> > > + * Do LLC balance on sched group that contains LLC, and have tasks preferring
> > > + * to run on LLC in idle dst_cpu.
> > > + */
> > > +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
> > > +			       struct sched_group *group)
> > > +{
> > > +	struct sched_domain *child = env->sd->child;
> > > +	int llc;
> > > +
> > > +	if (!sched_cache_enabled())
> > > +		return false;
> > > +
> > > +	if (env->sd->flags & SD_SHARE_LLC)
> > > +		return false;
> > > +
> > > +	/* only care about task migration among LLCs */
> > > +	if (child && !(child->flags & SD_SHARE_LLC))
> > > +		return false;
> > > +
> > > +	llc = llc_idx(env->dst_cpu);
> > > +	if (sgs->nr_pref_llc[llc] > 0 &&
> > > +	    can_migrate_llc(env->src_cpu, env->dst_cpu, 0, true) == mig_llc)
> >
> > llc_balance() is called from update_sg_lb_stats() and at that time,
> > env->src_cpu is not determined yet so should not be used here?
> >
>
> You are right, I think we should check the candidate group's first
> CPU rather than the env->src_cpu. Will fix it in the next version.

Looks like can_migrate_llc() doesn't care an exact cpu but any cpu in the
same LLC should do, so either the candidate group's first cpu or any
other cpus in that group should make no difference.

It might be more intuitive to prototype can_migrate_llc() with sd_shared
as param than using cpu, just a thought.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs
  2025-10-27 23:36       ` Tim Chen
@ 2025-10-29 12:36         ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-29 12:36 UTC (permalink / raw)
  To: Tim Chen
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar, K Prateek Nayak

On 10/28/2025 7:36 AM, Tim Chen wrote:
> On Mon, 2025-10-27 at 20:56 +0800, Chen, Yu C wrote:
>> Hi Prateek,
>>
>> On 10/27/2025 1:42 PM, K Prateek Nayak wrote:
>>> Hello Tim,
>>>
>>> On 10/11/2025 11:54 PM, Tim Chen wrote:
>>>> @@ -2530,10 +2531,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>>>    				 * between LLCs and memory channels.
>>>>    				 */
>>>>    				nr_llcs = sd->span_weight / child->span_weight;
>>>> -				if (nr_llcs == 1)
>>>> +				if (nr_llcs == 1) {
>>>>    					imb = sd->span_weight >> 3;
>>>> -				else
>>>> +				} else {
>>>>    					imb = nr_llcs;
>>>> +					has_multi_llcs = true;
>>>
>>> One caution: this will not hold if all the CPUs aren't online during boot.
>>> One case I can think of is when the kernel is booted with "maxcpus" cmdline
>>> and CPUs are hotplugged later.
>>>
>>> Unfortunately, I don't think we even have the raw topology data from the
>>> arch/ side under such scenario to accurately make a call if the system
>>> contains single or multiple LLC :(
>>>
>>> I'm not sure if it is feasible but assuming the task_work() cannot run if
>>> &sched_cache_allowed is false, can the fist instance of the task work for
>>> sched_cache do the necessary setup?
>>>
>>
>> build_sched_domains() might get invoked to rebuild the corresponding sched
>> domains during CPU hotplug via cpuset subsystem. So if the CPU gets online
>> after bootup, we still have the chance to detect multiple LLCs I suppose?
> 
> The case Pratek brought up of adding CPUs and enabling SCHED_CACHE
> should be covered.
> 
> The trickier case is if we disable SCHED_CACHE when CPUs are
> offlined and multi_cpus becomes false.  We'll need to clear out rq->nr_pref_llcs
> data and tasks' preferred LLC would need to be cleared.  Or else the accounting
> could be skewed we bring CPU online later and again re-enable SCHED_CACHE.

To safely keep the data consistent, do we need to add hook when 
SCHED_CACHE is
disabled at runtime, to clear all the stale data? In that way, another
race condition between SCHED_CACHE enabling vs EQ/DQ might also be covered.

thanks,
Chenyu
> So far we haven't done that when we disable SCHED_CACHE from an enabled state.
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue
  2025-10-29  4:32         ` K Prateek Nayak
@ 2025-10-29 12:48           ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-29 12:48 UTC (permalink / raw)
  To: K Prateek Nayak, Tim Chen
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra, Ingo Molnar,
	Gautham R . Shenoy

On 10/29/2025 12:32 PM, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 10/28/2025 9:16 PM, Tim Chen wrote:
>> On Tue, 2025-10-28 at 23:15 +0800, Chen, Yu C wrote:
>>> On 10/27/2025 2:04 PM, K Prateek Nayak wrote:
>>>> Hello Tim,
>>>>
>>>> On 10/11/2025 11:54 PM, Tim Chen wrote:
>>>>> @@ -3999,6 +4038,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>>    		struct rq *rq = rq_of(cfs_rq);
>>>>>    
>>>>>    		account_numa_enqueue(rq, task_of(se));
>>>>> +		account_llc_enqueue(rq, task_of(se));
>>>>>    		list_add(&se->group_node, &rq->cfs_tasks);
>>>>>    	}
>>>>>    	cfs_rq->nr_queued++;
>>>>> @@ -4010,9 +4050,14 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>>    	update_load_sub(&cfs_rq->load, se->load.weight);
>>>>>    	if (entity_is_task(se)) {
>>>>>    		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
>>>>> +		account_llc_dequeue(rq_of(cfs_rq), task_of(se));
>>>>>    		list_del_init(&se->group_node);
>>>>>    	}
>>>>>    	cfs_rq->nr_queued--;
>>>>> +
>>>>> +	/* safeguard to clear the cache aware data */
>>>>> +	if (!parent_entity(se) && !cfs_rq->nr_queued)
>>>>> +		reset_llc_stats(rq_of(cfs_rq));
>>>>
>>>> Instead of relying on reset_llc_stats() hack, I think a better approach
>>>> would be to have a "p->se.llc_sched_active" flag similar to how uclamp
>>>> has "uc_se->active" and we set this in account_llc_enqueue() which will
>>>> still check for sched_cache_enabled() but account_llc_dequeue() would
>>>> only check for "p->se.llc_sched_active" to decrement the stats and then
>>>> unset the flag.
>>>>
>>>> That way, we cannot have an imbalanced accounting. Thoughts?
>>>>
>>>
>>> I suppose what you mean is to avoid the race condition between
>>> enabling sched_cache and EQ/DE_LLC, similar to uclamp:
>>>
>>>           enqueue(taskA)
>>>           // sched_cache gets enabled
>>>           enqueue(taskB)
>>>           dequeue(taskA)
>>>           // Must not decrement rq->llc_pref for taskA
>>
>> For this case, task A is already on rq when sched cache get
>> enabled. But task A's preferred_llc is still -1.
>>
>> If we dequeue it while its preferred_llc is still -1, it won't
>> affect rq->llc_pref.
>>
>> If we change its preferred_llc to llc_i before we dequeue it,
>> then rq->llc_pref[llc_i] will be incremented first.
>>
>> Then when we dequeue task A, we will decrement it. We are
>> still accounting rq->llc_pref[llc_i] correctly with current
>> code.
> 
> So what I really disliked was having reset_llc_stats() to
> reset the stat but looking at it again, that too is guarded
> by sched_cache_enabled() counter so I think the counters can
> still go out of balance if:
> 
>      /* Cache aware scheduling enabled */
>      enqueue(TaskA) /* nr_llc_running = 1 */
>      enqueue(TaskB) /* nr_llc_running = 2 */
>      enqueue(TaskC) /* nr_llc_running = 3 */
>      dequeue(TaskA) /* nr_llc_running = 2 */
> 
>      /* Cache aware scheduling disabled */
> 
>     dequeue(TaskB) /* nr_llc_running = 2 */

If we introduce the mechanism you suggested previously:
"enable p->llc_sched_active in account_llc_enqueue(), which will
still check sched_cache_enabled(), but account_llc_dequeue() only
checks p->llc_sched_active to decrement the stats. Then the above
scenario might be covered: dequeue(TaskB) will decrease nr_llc_running
even if cache aware is disabled. Another idea is to reset all CPU
statistics when cache aware scheduling is disabled at runtime, this
might also avoid several race conditions, for example cpu hotplug vs
cache aware scheduling.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 19/19] sched/fair: Add user control to adjust the tolerance of cache-aware scheduling
  2025-10-29  8:07   ` Aaron Lu
@ 2025-10-29 12:54     ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-29 12:54 UTC (permalink / raw)
  To: Aaron Lu, Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel

On 10/29/2025 4:07 PM, Aaron Lu wrote:
> On Sat, Oct 11, 2025 at 11:24:56AM -0700, Tim Chen wrote:
> ... ...
>> +static inline int get_sched_cache_cap_scale(void)
>> +{
>> +	return (llc_overload_pct / cpu_smt_num_threads);
>> +}
>> +
> ... ...
>> @@ -9749,7 +9811,7 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
>>    * (default: ~50%)
>>    */
>>   #define fits_llc_capacity(util, max)	\
>> -	((util) * 100 < (max) * llc_overload_pct)
>> +	((util) * 100 < (max) * get_sched_cache_cap_scale())
>>
> 
> With this change, fits_llc_capacity() would be false if util is just 1/4
> of max(which is llc's total cpu capacity), is this intended?

Yes, it was changed to this because we want to avoid performance 
regressions
in some systems with a large number of SMTs per core-aggressive task 
aggregation
  is harmful to those systems. However, upon further thought, since we 
have a
user-space knob to control how aggressively users want to enable task 
aggregation,
we can try removing cpu_smt_num_threads and let users decide. I'll do some
tests to check the impact.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing
  2025-10-29  9:51       ` Aaron Lu
@ 2025-10-29 13:19         ` Chen, Yu C
  0 siblings, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-29 13:19 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Yangyu Chen, Tingyin Duan, Vern Hao, Len Brown, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Tim Chen, linux-kernel

On 10/29/2025 5:51 PM, Aaron Lu wrote:
> On Mon, Oct 27, 2025 at 10:00:52AM +0800, Chen, Yu C wrote:
>> Hi Aaron,
>>
>> On 10/24/2025 5:32 PM, Aaron Lu wrote:
>>> Hi Tim,
>>>
>>> On Sat, Oct 11, 2025 at 11:24:47AM -0700, Tim Chen wrote:
>>>> @@ -10849,11 +10849,45 @@ static void record_sg_llc_stats(struct lb_env *env,
>>>>    	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
>>>>    		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
>>>>    }
>>>> +
>>>> +/*
>>>> + * Do LLC balance on sched group that contains LLC, and have tasks preferring
>>>> + * to run on LLC in idle dst_cpu.
>>>> + */
>>>> +static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
>>>> +			       struct sched_group *group)
>>>> +{
>>>> +	struct sched_domain *child = env->sd->child;
>>>> +	int llc;
>>>> +
>>>> +	if (!sched_cache_enabled())
>>>> +		return false;
>>>> +
>>>> +	if (env->sd->flags & SD_SHARE_LLC)
>>>> +		return false;
>>>> +
>>>> +	/* only care about task migration among LLCs */
>>>> +	if (child && !(child->flags & SD_SHARE_LLC))
>>>> +		return false;
>>>> +
>>>> +	llc = llc_idx(env->dst_cpu);
>>>> +	if (sgs->nr_pref_llc[llc] > 0 &&
>>>> +	    can_migrate_llc(env->src_cpu, env->dst_cpu, 0, true) == mig_llc)
>>>
>>> llc_balance() is called from update_sg_lb_stats() and at that time,
>>> env->src_cpu is not determined yet so should not be used here?
>>>
>>
>> You are right, I think we should check the candidate group's first
>> CPU rather than the env->src_cpu. Will fix it in the next version.
> 
> Looks like can_migrate_llc() doesn't care an exact cpu but any cpu in the
> same LLC should do, so either the candidate group's first cpu or any
> other cpus in that group should make no difference.
> 

Yes, actually the cache aware is based on LLC rather than CPU. It was
a historic reason that the original proposal was based on CPU in the
wakeup path.

> It might be more intuitive to prototype can_migrate_llc() with sd_shared
> as param than using cpu, just a thought.

I understand your concern,  will think about this.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-29  3:54       ` K Prateek Nayak
@ 2025-10-29 14:23         ` Chen, Yu C
  2025-10-29 21:09         ` Tim Chen
  1 sibling, 0 replies; 116+ messages in thread
From: Chen, Yu C @ 2025-10-29 14:23 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Tim Chen, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

On 10/29/2025 11:54 AM, K Prateek Nayak wrote:

[snip]

>>>> @@ -10227,6 +10233,20 @@ static int detach_tasks(struct lb_env *env)
>>>>            if (env->imbalance <= 0)
>>>>                break;
>>>>    +#ifdef CONFIG_SCHED_CACHE
>>>> +        /*
>>>> +         * Don't detach more tasks if the remaining tasks want
>>>> +         * to stay. We know the remaining tasks all prefer the
>>>> +         * current LLC, because after order_tasks_by_llc(), the
>>>> +         * tasks that prefer the current LLC are at the tail of
>>>> +         * the list. The inhibition of detachment is to avoid too
>>>> +         * many tasks being migrated out of the preferred LLC.
>>>> +         */
>>>> +        if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
>>>> +            llc_id(env->src_cpu) == p->preferred_llc)
>>>> +            break;
>>>
>>> In all cases? Should we check can_migrate_llc() wrt to util migrated and
>>> then make a call if we should move the preferred LLC tasks or not?
>>>
>>
>> Prior to this "stop of detaching tasks", we performed a can_migrate_task(p)
>> to determine if the detached p is dequeued from its preferred LLC, and in
>> can_migrate_task(), we use can_migrate_llc_task() -> can_migrate_llc() to
>> carry out the check. That is to say, only when certain tasks have been
>> detached, will we stop further detaching.
>>
>>> Perhaps disallow it the first time if "nr_balance_failed" is 0 but
>>> subsequent failed attempts should perhaps explore breaking the preferred
>>> llc restriction if there is an imbalance and we are under
>>> "mig_unrestricted" conditions.
>>>
>>
>> I suppose you are suggesting that the threshold for stopping task detachment
>> should be higher. With the above can_migrate_llc() check, I suppose we have
>> raised the threshold for stopping "task detachment"?
> 
> Say the LLC is under heavy load and we only have overloaded groups.
> can_migrate_llc() would return "mig_unrestricted" since
> fits_llc_capacity() would return false.
> 
> Since we are under "migrate_load", sched_balance_find_src_rq() has
> returned the CPU with the highest load which could very well be the
> CPU with with a large number of preferred LLC tasks.
> 
> sched_cache_enabled() is still true and when detach_tasks() reaches
> one of these preferred llc tasks (which comes at the very end of the
> tasks list), we break out even if env->imbalance > 0 leaving
> potential imbalance for the "migrate_load" case.
> 
> Instead, we can account for the util moved out of the src_llc and
> after accounting for it, check if can_migrate_llc() would return
> "mig_forbid" for the src llc.
> 

I see your point, the original decision matrix intends to
spread the tasks when both LLCs are overloaded.
(src is the preferred LLC, dst is non-preferred LLC)

src \ dst      30%  40%  50%  60%
30%            N    N    N    N
40%            N    N    N    N
50%            N    N    G    G
60%            Y    N    G    G

  src :      src_util
  dst :      dst_util
  Y :        Yes, migrate
  N :        No, do not migrate
  G :        let the Generic load balance to even the load.

I suppose the reason why the code breaks the rule here is because
as Tim mentioned in another thread, to inhibit the task bouncing
between LLCs.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-29  3:54       ` K Prateek Nayak
  2025-10-29 14:23         ` Chen, Yu C
@ 2025-10-29 21:09         ` Tim Chen
  2025-10-30  4:19           ` K Prateek Nayak
  1 sibling, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-29 21:09 UTC (permalink / raw)
  To: K Prateek Nayak, Chen, Yu C
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

On Wed, 2025-10-29 at 09:24 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 10/28/2025 5:28 PM, Chen, Yu C wrote:
> > Hi Prateek,
> > 
> > On 10/28/2025 2:02 PM, K Prateek Nayak wrote:
> > > Hello Tim,
> > > 
> > > On 10/11/2025 11:54 PM, Tim Chen wrote:
> > > > @@ -9969,6 +9969,12 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> > > >       if (env->flags & LBF_ACTIVE_LB)
> > > >           return 1;
> > > >   +#ifdef CONFIG_SCHED_CACHE
> > > > +    if (sched_cache_enabled() &&
> > > > +        can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) == mig_forbid)
> > > > +        return 0;
> > > > +#endif
> > > > +
> > > >       degrades = migrate_degrades_locality(p, env);
> > > >       if (!degrades)
> > > >           hot = task_hot(p, env);
> > > 
> > > Should we care for task_hot() w.r.t. migration cost if a task is being
> > > moved to a preferred LLC?
> > > 
> > 
> > This is a good question. The decision not to migrate a task when its
> > LLC preference is violated takes priority over the check in task_hot().
> > 
> > The main reason is that we want cache aware aggregation to be more
> > aggressive than generic migration; otherwise, cache-aware migration
> >  might not take effect according to our previous test. This seems to
> > be a trade-off. Another consideration might be: should we consider
> > the occupancy of a single thread or that of the entire process?
> > For example, suppose t0, t1, and t2 belong to the same process. t0
> > and t1 are running on the process's preferred LLC0, while t2 is
> > running on the non-preferred LLC1. Even though t2 has high occupancy
> > on LLC1 (making it cache-hot on LLC1), we might still want to move t2
> > to LLC0 if t0, t1, and t2 read from and write to each other - since we don't want to generate cross-LLC access.
> 
> Makes sense. That would need some heuristics based on the avg_running
> to know which LLC can be be a potential target with fewest migrations.
> But then again, in a dynamic system things change so quickly - what
> you have now seems to be a good start to further optimize on top of.
> 
> > 
> > > Also, should we leave out tasks under core scheduling from the llc
> > > aware lb? Even discount them when calculating "mm->nr_running_avg"?
> > > 
> > Yes, it seems that the cookie match check case was missed, which is
> > embedded in task_hot(). I suppose you are referring to the p->core_cookie
> > check; I'll look into this direction.
> 
> Yup! I think if user has opted into core scheduling, they should ideally
> not bother about cache aware scheduling.
> 
> > 
> > > > @@ -10227,6 +10233,20 @@ static int detach_tasks(struct lb_env *env)
> > > >           if (env->imbalance <= 0)
> > > >               break;
> > > >   +#ifdef CONFIG_SCHED_CACHE
> > > > +        /*
> > > > +         * Don't detach more tasks if the remaining tasks want
> > > > +         * to stay. We know the remaining tasks all prefer the
> > > > +         * current LLC, because after order_tasks_by_llc(), the
> > > > +         * tasks that prefer the current LLC are at the tail of
> > > > +         * the list. The inhibition of detachment is to avoid too
> > > > +         * many tasks being migrated out of the preferred LLC.
> > > > +         */
> > > > +        if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
> > > > +            llc_id(env->src_cpu) == p->preferred_llc)
> > > > +            break;
> > > 
> > > In all cases? 
> > > 

Not in all cases, but only when we know that the remaining tasks prefer to
stay in current LLC and not be moved to an LLC it doesn't like.

I think we need to add the check that
llc_id(env->dst_cpu) != p->preferred_llc in the above condition 

> > > Should we check can_migrate_llc() wrt to util migrated and
> > > then make a call if we should move the preferred LLC tasks or not?
> > > 
> > 
> > Prior to this "stop of detaching tasks", we performed a can_migrate_task(p)
> > to determine if the detached p is dequeued from its preferred LLC, and in
> > can_migrate_task(), we use can_migrate_llc_task() -> can_migrate_llc() to
> > carry out the check. That is to say, only when certain tasks have been
> > detached, will we stop further detaching.
> > 
> > > Perhaps disallow it the first time if "nr_balance_failed" is 0 but
> > > subsequent failed attempts should perhaps explore breaking the preferred
> > > llc restriction if there is an imbalance and we are under
> > > "mig_unrestricted" conditions.
> > > 
> > 
> > I suppose you are suggesting that the threshold for stopping task detachment
> > should be higher. With the above can_migrate_llc() check, I suppose we have
> > raised the threshold for stopping "task detachment"?
> 
> Say the LLC is under heavy load and we only have overloaded groups.
> can_migrate_llc() would return "mig_unrestricted" since
> fits_llc_capacity() would return false.
> 
> Since we are under "migrate_load", sched_balance_find_src_rq() has
> returned the CPU with the highest load which could very well be the
> CPU with with a large number of preferred LLC tasks.
> 
> sched_cache_enabled() is still true and when detach_tasks() reaches
> one of these preferred llc tasks (which comes at the very end of the
> tasks list), 
> we break out even if env->imbalance > 0 leaving

Yes, but at least one task has been removed to even the load (making forward progress) and
the remaining tasks all wish to stay in the current LLC and will
preferred not to be moved. My thought was to not even all the load out
in one shot and pull more tasks out of their preferred LLC.
If the imbalance still remain, we'll come to that in the next load balance.

Pulling tasks more slowly when we come to tasks that preferred to stay (if possible)
would also help to prevent tasks bouncing between LLC.

Tim

> potential imbalance for the "migrate_load" case.
> 
> Instead, we can account for the util moved out of the src_llc and
> after accounting for it, check if can_migrate_llc() would return
> "mig_forbid" for the src llc.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-29 21:09         ` Tim Chen
@ 2025-10-30  4:19           ` K Prateek Nayak
  2025-10-30 20:07             ` Tim Chen
  0 siblings, 1 reply; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-30  4:19 UTC (permalink / raw)
  To: Tim Chen, Chen, Yu C
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

Hello Tim,

On 10/30/2025 2:39 AM, Tim Chen wrote:
>>> I suppose you are suggesting that the threshold for stopping task detachment
>>> should be higher. With the above can_migrate_llc() check, I suppose we have
>>> raised the threshold for stopping "task detachment"?
>>
>> Say the LLC is under heavy load and we only have overloaded groups.
>> can_migrate_llc() would return "mig_unrestricted" since
>> fits_llc_capacity() would return false.
>>
>> Since we are under "migrate_load", sched_balance_find_src_rq() has
>> returned the CPU with the highest load which could very well be the
>> CPU with with a large number of preferred LLC tasks.
>>
>> sched_cache_enabled() is still true and when detach_tasks() reaches
>> one of these preferred llc tasks (which comes at the very end of the
>> tasks list), 
>> we break out even if env->imbalance > 0 leaving
> 
> Yes, but at least one task has been removed to even the load (making forward progress) and
> the remaining tasks all wish to stay in the current LLC and will
> preferred not to be moved. My thought was to not even all the load out
> in one shot and pull more tasks out of their preferred LLC.
> If the imbalance still remain, we'll come to that in the next load balance.

In that case, can we spoof a LBF_ALL_PINNED for the case where we start
hitting preferred task. That way, the main lb loop will goto redo and
try to find another busy CPU to pull tasks from.

> 
> Pulling tasks more slowly when we come to tasks that preferred to stay (if possible)
> would also help to prevent tasks bouncing between LLC.
> 
> Tim
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-30  4:19           ` K Prateek Nayak
@ 2025-10-30 20:07             ` Tim Chen
  2025-10-31  3:32               ` K Prateek Nayak
  0 siblings, 1 reply; 116+ messages in thread
From: Tim Chen @ 2025-10-30 20:07 UTC (permalink / raw)
  To: K Prateek Nayak, Chen, Yu C
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

On Thu, 2025-10-30 at 09:49 +0530, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 10/30/2025 2:39 AM, Tim Chen wrote:
> > > > I suppose you are suggesting that the threshold for stopping task detachment
> > > > should be higher. With the above can_migrate_llc() check, I suppose we have
> > > > raised the threshold for stopping "task detachment"?
> > > 
> > > Say the LLC is under heavy load and we only have overloaded groups.
> > > can_migrate_llc() would return "mig_unrestricted" since
> > > fits_llc_capacity() would return false.
> > > 
> > > Since we are under "migrate_load", sched_balance_find_src_rq() has
> > > returned the CPU with the highest load which could very well be the
> > > CPU with with a large number of preferred LLC tasks.
> > > 
> > > sched_cache_enabled() is still true and when detach_tasks() reaches
> > > one of these preferred llc tasks (which comes at the very end of the
> > > tasks list), 
> > > we break out even if env->imbalance > 0 leaving
> > 
> > Yes, but at least one task has been removed to even the load (making forward progress) and
> > the remaining tasks all wish to stay in the current LLC and will
> > preferred not to be moved. My thought was to not even all the load out
> > in one shot and pull more tasks out of their preferred LLC.
> > If the imbalance still remain, we'll come to that in the next load balance.
> 
> In that case, can we spoof a LBF_ALL_PINNED for the case where we start

In the code chunk (with fix I mentioned in last reply):

+#ifdef CONFIG_SCHED_CACHE
+		/*
+		 * Don't detach more tasks if the remaining tasks want
+		 * to stay. We know the remaining tasks all prefer the
+		 * current LLC, because after order_tasks_by_llc(), the
+		 * tasks that prefer the current LLC are at the tail of
+		 * the list. The inhibition of detachment is to avoid too
+		 * many tasks being migrated out of the preferred LLC.
+		 */
+		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
+		    llc_id(env->src_cpu) == p->preferred_llc &&
		    llc_id(env->dst_cpu) != p->preferred_llc)
+			break;

We have already pulled at least one task when we stop detaching because we
know that all the remaining tasks want to stay in it current LLC.
"detached" is non zero when we break. So LBF_ALL_PINNED would be cleared.
We will only exit the detach_tasks loop when there are truly no tasks
that can be moved and it is truly a LBF_ALL_PINNED case.

We should not be causing problem with the LBF_ALL_PINNED.

Tim
> hitting preferred task. That way, the main lb loop will goto redo and
> try to find another busy CPU to pull tasks from.
> 
> > 
> > Pulling tasks more slowly when we come to tasks that preferred to stay (if possible)
> > would also help to prevent tasks bouncing between LLC.
> > 
> > Tim
> > 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-30 20:07             ` Tim Chen
@ 2025-10-31  3:32               ` K Prateek Nayak
  2025-10-31 15:17                 ` Chen, Yu C
  2025-11-03 22:07                 ` Tim Chen
  0 siblings, 2 replies; 116+ messages in thread
From: K Prateek Nayak @ 2025-10-31  3:32 UTC (permalink / raw)
  To: Tim Chen, Chen, Yu C
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

Hello Tim,

On 10/31/2025 1:37 AM, Tim Chen wrote:
> On Thu, 2025-10-30 at 09:49 +0530, K Prateek Nayak wrote:
>> Hello Tim,
>>
>> On 10/30/2025 2:39 AM, Tim Chen wrote:
>>>>> I suppose you are suggesting that the threshold for stopping task detachment
>>>>> should be higher. With the above can_migrate_llc() check, I suppose we have
>>>>> raised the threshold for stopping "task detachment"?
>>>>
>>>> Say the LLC is under heavy load and we only have overloaded groups.
>>>> can_migrate_llc() would return "mig_unrestricted" since
>>>> fits_llc_capacity() would return false.
>>>>
>>>> Since we are under "migrate_load", sched_balance_find_src_rq() has
>>>> returned the CPU with the highest load which could very well be the
>>>> CPU with with a large number of preferred LLC tasks.
>>>>
>>>> sched_cache_enabled() is still true and when detach_tasks() reaches
>>>> one of these preferred llc tasks (which comes at the very end of the
>>>> tasks list), 
>>>> we break out even if env->imbalance > 0 leaving
>>>
>>> Yes, but at least one task has been removed to even the load (making forward progress) and
>>> the remaining tasks all wish to stay in the current LLC and will
>>> preferred not to be moved. My thought was to not even all the load out
>>> in one shot and pull more tasks out of their preferred LLC.
>>> If the imbalance still remain, we'll come to that in the next load balance.
>>
>> In that case, can we spoof a LBF_ALL_PINNED for the case where we start
> 
> In the code chunk (with fix I mentioned in last reply):
> 
> +#ifdef CONFIG_SCHED_CACHE
> +		/*
> +		 * Don't detach more tasks if the remaining tasks want
> +		 * to stay. We know the remaining tasks all prefer the
> +		 * current LLC, because after order_tasks_by_llc(), the
> +		 * tasks that prefer the current LLC are at the tail of
> +		 * the list. The inhibition of detachment is to avoid too
> +		 * many tasks being migrated out of the preferred LLC.
> +		 */
> +		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
> +		    llc_id(env->src_cpu) == p->preferred_llc &&
> 		    llc_id(env->dst_cpu) != p->preferred_llc)
> +			break;
> 
> We have already pulled at least one task when we stop detaching because we
> know that all the remaining tasks want to stay in it current LLC.
> "detached" is non zero when we break. So LBF_ALL_PINNED would be cleared.
> We will only exit the detach_tasks loop when there are truly no tasks
> that can be moved and it is truly a LBF_ALL_PINNED case.

So what I was suggesting is something like:

@@ -10251,6 +10252,7 @@ static int detach_tasks(struct lb_env *env)
 	unsigned long util, load;
 	struct task_struct *p;
 	int detached = 0;
+	bool preserve_preferred;
 
 	lockdep_assert_rq_held(env->src_rq);
 
@@ -10268,6 +10270,10 @@ static int detach_tasks(struct lb_env *env)
 
 	tasks = order_tasks_by_llc(env, &env->src_rq->cfs_tasks);
 
+	preserve_preferred = sched_cache_enabled() &&
+			     !(env->sd->flags & SD_SHARE_LLC) &&
+			     !sd->nr_balance_failed;
+
 	while (!list_empty(tasks)) {
 		/*
 		 * We don't want to steal all, otherwise we may be treated likewise,
@@ -10370,16 +10376,15 @@ static int detach_tasks(struct lb_env *env)
 
 #ifdef CONFIG_SCHED_CACHE
 		/*
-		 * Don't detach more tasks if the remaining tasks want
-		 * to stay. We know the remaining tasks all prefer the
-		 * current LLC, because after order_tasks_by_llc(), the
-		 * tasks that prefer the current LLC are at the tail of
-		 * the list. The inhibition of detachment is to avoid too
-		 * many tasks being migrated out of the preferred LLC.
+		 * We've hit tasks that prefer src LLC while balancing between LLCs.
+		 * If previous balances have been successful, pretend the rest of the
+		 * tasks on this CPU are pinned and let the main load balancing loop
+		 * find another target CPU to pull from if imbalance exists.
 		 */
-		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
-		    llc_id(env->src_cpu) == p->preferred_llc)
+		if (preserve_preferred && detached && llc_id(env->src_cpu) == p->preferred_llc) {
+			env->flags |= LBF_ALL_PINNED;
 			break;
+		}
 #endif
 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-31  3:32               ` K Prateek Nayak
@ 2025-10-31 15:17                 ` Chen, Yu C
  2025-11-03 21:41                   ` Tim Chen
  2025-11-03 22:07                 ` Tim Chen
  1 sibling, 1 reply; 116+ messages in thread
From: Chen, Yu C @ 2025-10-31 15:17 UTC (permalink / raw)
  To: K Prateek Nayak, Tim Chen
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

Hi Prateek,

On 10/31/2025 11:32 AM, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 10/31/2025 1:37 AM, Tim Chen wrote:
>> On Thu, 2025-10-30 at 09:49 +0530, K Prateek Nayak wrote:
>>> Hello Tim,
>>>
>>> On 10/30/2025 2:39 AM, Tim Chen wrote:
>>>>>> I suppose you are suggesting that the threshold for stopping task detachment
>>>>>> should be higher. With the above can_migrate_llc() check, I suppose we have
>>>>>> raised the threshold for stopping "task detachment"?
>>>>>
>>>>> Say the LLC is under heavy load and we only have overloaded groups.
>>>>> can_migrate_llc() would return "mig_unrestricted" since
>>>>> fits_llc_capacity() would return false.
>>>>>
>>>>> Since we are under "migrate_load", sched_balance_find_src_rq() has
>>>>> returned the CPU with the highest load which could very well be the
>>>>> CPU with with a large number of preferred LLC tasks.
>>>>>
>>>>> sched_cache_enabled() is still true and when detach_tasks() reaches
>>>>> one of these preferred llc tasks (which comes at the very end of the
>>>>> tasks list),
>>>>> we break out even if env->imbalance > 0 leaving
>>>>
>>>> Yes, but at least one task has been removed to even the load (making forward progress) and
>>>> the remaining tasks all wish to stay in the current LLC and will
>>>> preferred not to be moved. My thought was to not even all the load out
>>>> in one shot and pull more tasks out of their preferred LLC.
>>>> If the imbalance still remain, we'll come to that in the next load balance.
>>>
>>> In that case, can we spoof a LBF_ALL_PINNED for the case where we start
>>
>> In the code chunk (with fix I mentioned in last reply):
>>
>> +#ifdef CONFIG_SCHED_CACHE
>> +		/*
>> +		 * Don't detach more tasks if the remaining tasks want
>> +		 * to stay. We know the remaining tasks all prefer the
>> +		 * current LLC, because after order_tasks_by_llc(), the
>> +		 * tasks that prefer the current LLC are at the tail of
>> +		 * the list. The inhibition of detachment is to avoid too
>> +		 * many tasks being migrated out of the preferred LLC.
>> +		 */
>> +		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
>> +		    llc_id(env->src_cpu) == p->preferred_llc &&
>> 		    llc_id(env->dst_cpu) != p->preferred_llc)
>> +			break;
>>
>> We have already pulled at least one task when we stop detaching because we
>> know that all the remaining tasks want to stay in it current LLC.
>> "detached" is non zero when we break. So LBF_ALL_PINNED would be cleared.
>> We will only exit the detach_tasks loop when there are truly no tasks
>> that can be moved and it is truly a LBF_ALL_PINNED case.
> 
> So what I was suggesting is something like:
> 
> @@ -10251,6 +10252,7 @@ static int detach_tasks(struct lb_env *env)
>   	unsigned long util, load;
>   	struct task_struct *p;
>   	int detached = 0;
> +	bool preserve_preferred;
>   
>   	lockdep_assert_rq_held(env->src_rq);
>   
> @@ -10268,6 +10270,10 @@ static int detach_tasks(struct lb_env *env)
>   
>   	tasks = order_tasks_by_llc(env, &env->src_rq->cfs_tasks);
>   
> +	preserve_preferred = sched_cache_enabled() &&
> +			     !(env->sd->flags & SD_SHARE_LLC) &&

Maybe also check (env->sd->child->flag & SD_SHARE_LLC) because we only
care about the domain that is the parent of a LLC domain.

> +			     !sd->nr_balance_failed;
 > +
>   	while (!list_empty(tasks)) {
>   		/*
>   		 * We don't want to steal all, otherwise we may be treated likewise,
> @@ -10370,16 +10376,15 @@ static int detach_tasks(struct lb_env *env)
>   
>   #ifdef CONFIG_SCHED_CACHE
>   		/*
> -		 * Don't detach more tasks if the remaining tasks want
> -		 * to stay. We know the remaining tasks all prefer the
> -		 * current LLC, because after order_tasks_by_llc(), the
> -		 * tasks that prefer the current LLC are at the tail of
> -		 * the list. The inhibition of detachment is to avoid too
> -		 * many tasks being migrated out of the preferred LLC.
> +		 * We've hit tasks that prefer src LLC while balancing between LLCs.
> +		 * If previous balances have been successful, pretend the rest of the
> +		 * tasks on this CPU are pinned and let the main load balancing loop
> +		 * find another target CPU to pull from if imbalance exists.
>   		 */
> -		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
> -		    llc_id(env->src_cpu) == p->preferred_llc)
> +		if (preserve_preferred && detached && llc_id(env->src_cpu) == p->preferred_llc) {
> +			env->flags |= LBF_ALL_PINNED;

Let me try to understand this strategy: if all previous migrations
on this sched_domain have succeeded, it means that even if we stop
migrating tasks out of this busiest CPU from now on, it won’t
matter because the imbalance has already been mitigated. If we stop
the migration, we should look for other busy CPUs to pull some tasks
from. One concern is that setting LBF_ALL_PINNED and only clearing
env->dst_cpu will trigger a full re-scan of the entire sched_domain,
which might be costly-especially on large LLCs. We can try this to
see if it has any impact on the benchmark.

thanks,
Chenyu

>   			break;
> +		}
>   #endif
>   
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-31 15:17                 ` Chen, Yu C
@ 2025-11-03 21:41                   ` Tim Chen
  0 siblings, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-11-03 21:41 UTC (permalink / raw)
  To: Chen, Yu C, K Prateek Nayak
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

On Fri, 2025-10-31 at 23:17 +0800, Chen, Yu C wrote:
> Hi Prateek,
> 
> On 10/31/2025 11:32 AM, K Prateek Nayak wrote:
> > Hello Tim,
> > 
> > On 10/31/2025 1:37 AM, Tim Chen wrote:
> > > On Thu, 2025-10-30 at 09:49 +0530, K Prateek Nayak wrote:
> > > > Hello Tim,
> > > > 
> > > > On 10/30/2025 2:39 AM, Tim Chen wrote:
> > > > > > > I suppose you are suggesting that the threshold for stopping task detachment
> > > > > > > should be higher. With the above can_migrate_llc() check, I suppose we have
> > > > > > > raised the threshold for stopping "task detachment"?
> > > > > > 
> > > > > > Say the LLC is under heavy load and we only have overloaded groups.
> > > > > > can_migrate_llc() would return "mig_unrestricted" since
> > > > > > fits_llc_capacity() would return false.
> > > > > > 
> > > > > > Since we are under "migrate_load", sched_balance_find_src_rq() has
> > > > > > returned the CPU with the highest load which could very well be the
> > > > > > CPU with with a large number of preferred LLC tasks.
> > > > > > 
> > > > > > sched_cache_enabled() is still true and when detach_tasks() reaches
> > > > > > one of these preferred llc tasks (which comes at the very end of the
> > > > > > tasks list),
> > > > > > we break out even if env->imbalance > 0 leaving
> > > > > 
> > > > > Yes, but at least one task has been removed to even the load (making forward progress) and
> > > > > the remaining tasks all wish to stay in the current LLC and will
> > > > > preferred not to be moved. My thought was to not even all the load out
> > > > > in one shot and pull more tasks out of their preferred LLC.
> > > > > If the imbalance still remain, we'll come to that in the next load balance.
> > > > 
> > > > In that case, can we spoof a LBF_ALL_PINNED for the case where we start
> > > 
> > > In the code chunk (with fix I mentioned in last reply):
> > > 
> > > +#ifdef CONFIG_SCHED_CACHE
> > > +		/*
> > > +		 * Don't detach more tasks if the remaining tasks want
> > > +		 * to stay. We know the remaining tasks all prefer the
> > > +		 * current LLC, because after order_tasks_by_llc(), the
> > > +		 * tasks that prefer the current LLC are at the tail of
> > > +		 * the list. The inhibition of detachment is to avoid too
> > > +		 * many tasks being migrated out of the preferred LLC.
> > > +		 */
> > > +		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
> > > +		    llc_id(env->src_cpu) == p->preferred_llc &&
> > > 		    llc_id(env->dst_cpu) != p->preferred_llc)
> > > +			break;
> > > 
> > > We have already pulled at least one task when we stop detaching because we
> > > know that all the remaining tasks want to stay in it current LLC.
> > > "detached" is non zero when we break. So LBF_ALL_PINNED would be cleared.
> > > We will only exit the detach_tasks loop when there are truly no tasks
> > > that can be moved and it is truly a LBF_ALL_PINNED case.
> > 
> > So what I was suggesting is something like:
> > 
> > @@ -10251,6 +10252,7 @@ static int detach_tasks(struct lb_env *env)
> >   	unsigned long util, load;
> >   	struct task_struct *p;
> >   	int detached = 0;
> > +	bool preserve_preferred;
> >   
> >   	lockdep_assert_rq_held(env->src_rq);
> >   
> > @@ -10268,6 +10270,10 @@ static int detach_tasks(struct lb_env *env)
> >   
> >   	tasks = order_tasks_by_llc(env, &env->src_rq->cfs_tasks);
> >   
> > +	preserve_preferred = sched_cache_enabled() &&
> > +			     !(env->sd->flags & SD_SHARE_LLC) &&
> 
> Maybe also check (env->sd->child->flag & SD_SHARE_LLC) because we only
> care about the domain that is the parent of a LLC domain.
> 
> > +			     !sd->nr_balance_failed;
>  > +
> >   	while (!list_empty(tasks)) {
> >   		/*
> >   		 * We don't want to steal all, otherwise we may be treated likewise,
> > @@ -10370,16 +10376,15 @@ static int detach_tasks(struct lb_env *env)
> >   
> >   #ifdef CONFIG_SCHED_CACHE
> >   		/*
> > -		 * Don't detach more tasks if the remaining tasks want
> > -		 * to stay. We know the remaining tasks all prefer the
> > -		 * current LLC, because after order_tasks_by_llc(), the
> > -		 * tasks that prefer the current LLC are at the tail of
> > -		 * the list. The inhibition of detachment is to avoid too
> > -		 * many tasks being migrated out of the preferred LLC.
> > +		 * We've hit tasks that prefer src LLC while balancing between LLCs.
> > +		 * If previous balances have been successful, pretend the rest of the
> > +		 * tasks on this CPU are pinned and let the main load balancing loop
> > +		 * find another target CPU to pull from if imbalance exists.
> >   		 */
> > -		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
> > -		    llc_id(env->src_cpu) == p->preferred_llc)
> > +		if (preserve_preferred && detached && llc_id(env->src_cpu) == p->preferred_llc) {
> > +			env->flags |= LBF_ALL_PINNED;
> 
> Let me try to understand this strategy: if all previous migrations
> on this sched_domain have succeeded, it means that even if we stop
> migrating tasks out of this busiest CPU from now on, it won’t
> matter because the imbalance has already been mitigated. If we stop
> the migration, we should look for other busy CPUs to pull some tasks
> from. One concern is that setting LBF_ALL_PINNED and only clearing
> env->dst_cpu will trigger a full re-scan of the entire sched_domain,
> which might be costly-especially on large LLCs. We can try this to
> see if it has any impact on the benchmark.

I think it does cause update_sd_lb_stats() to be called again with
the previous rq taken out.  So we are spending more CPU cycles
to find an alternative task to balance to try to preserve LLC preference.

Tim

> 
> thanks,
> Chenyu
> 
> >   			break;
> > +		}
> >   #endif
> >   
> > 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach
  2025-10-31  3:32               ` K Prateek Nayak
  2025-10-31 15:17                 ` Chen, Yu C
@ 2025-11-03 22:07                 ` Tim Chen
  1 sibling, 0 replies; 116+ messages in thread
From: Tim Chen @ 2025-11-03 22:07 UTC (permalink / raw)
  To: K Prateek Nayak, Chen, Yu C
  Cc: Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Tim Chen, linux-kernel, Peter Zijlstra,
	Gautham R . Shenoy, Ingo Molnar

On Fri, 2025-10-31 at 09:02 +0530, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 10/31/2025 1:37 AM, Tim Chen wrote:
> > On Thu, 2025-10-30 at 09:49 +0530, K Prateek Nayak wrote:
> > > Hello Tim,
> > > 
> > > On 10/30/2025 2:39 AM, Tim Chen wrote:
> > > > > > I suppose you are suggesting that the threshold for stopping task detachment
> > > > > > should be higher. With the above can_migrate_llc() check, I suppose we have
> > > > > > raised the threshold for stopping "task detachment"?
> > > > > 
> > > > > Say the LLC is under heavy load and we only have overloaded groups.
> > > > > can_migrate_llc() would return "mig_unrestricted" since
> > > > > fits_llc_capacity() would return false.
> > > > > 
> > > > > Since we are under "migrate_load", sched_balance_find_src_rq() has
> > > > > returned the CPU with the highest load which could very well be the
> > > > > CPU with with a large number of preferred LLC tasks.
> > > > > 
> > > > > sched_cache_enabled() is still true and when detach_tasks() reaches
> > > > > one of these preferred llc tasks (which comes at the very end of the
> > > > > tasks list), 
> > > > > we break out even if env->imbalance > 0 leaving
> > > > 
> > > > Yes, but at least one task has been removed to even the load (making forward progress) and
> > > > the remaining tasks all wish to stay in the current LLC and will
> > > > preferred not to be moved. My thought was to not even all the load out
> > > > in one shot and pull more tasks out of their preferred LLC.
> > > > If the imbalance still remain, we'll come to that in the next load balance.
> > > 
> > > In that case, can we spoof a LBF_ALL_PINNED for the case where we start
> > 
> > In the code chunk (with fix I mentioned in last reply):
> > 
> > +#ifdef CONFIG_SCHED_CACHE
> > +		/*
> > +		 * Don't detach more tasks if the remaining tasks want
> > +		 * to stay. We know the remaining tasks all prefer the
> > +		 * current LLC, because after order_tasks_by_llc(), the
> > +		 * tasks that prefer the current LLC are at the tail of
> > +		 * the list. The inhibition of detachment is to avoid too
> > +		 * many tasks being migrated out of the preferred LLC.
> > +		 */
> > +		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
> > +		    llc_id(env->src_cpu) == p->preferred_llc &&
> > 		    llc_id(env->dst_cpu) != p->preferred_llc)
> > +			break;
> > 
> > We have already pulled at least one task when we stop detaching because we
> > know that all the remaining tasks want to stay in it current LLC.
> > "detached" is non zero when we break. So LBF_ALL_PINNED would be cleared.
> > We will only exit the detach_tasks loop when there are truly no tasks
> > that can be moved and it is truly a LBF_ALL_PINNED case.
> 
> So what I was suggesting is something like:
> 
> @@ -10251,6 +10252,7 @@ static int detach_tasks(struct lb_env *env)
>  	unsigned long util, load;
>  	struct task_struct *p;
>  	int detached = 0;
> +	bool preserve_preferred;
>  
>  	lockdep_assert_rq_held(env->src_rq);
>  
> @@ -10268,6 +10270,10 @@ static int detach_tasks(struct lb_env *env)
>  
>  	tasks = order_tasks_by_llc(env, &env->src_rq->cfs_tasks);
>  
> +	preserve_preferred = sched_cache_enabled() &&
> +			     !(env->sd->flags & SD_SHARE_LLC) &&
> +			     !sd->nr_balance_failed;
> +
>  	while (!list_empty(tasks)) {
>  		/*
>  		 * We don't want to steal all, otherwise we may be treated likewise,
> @@ -10370,16 +10376,15 @@ static int detach_tasks(struct lb_env *env)
>  
>  #ifdef CONFIG_SCHED_CACHE
>  		/*
> -		 * Don't detach more tasks if the remaining tasks want
> -		 * to stay. We know the remaining tasks all prefer the
> -		 * current LLC, because after order_tasks_by_llc(), the
> -		 * tasks that prefer the current LLC are at the tail of
> -		 * the list. The inhibition of detachment is to avoid too
> -		 * many tasks being migrated out of the preferred LLC.
> +		 * We've hit tasks that prefer src LLC while balancing between LLCs.
> +		 * If previous balances have been successful, pretend the rest of the
> +		 * tasks on this CPU are pinned and let the main load balancing loop
> +		 * find another target CPU to pull from if imbalance exists.
>  		 */
> -		if (sched_cache_enabled() && detached && p->preferred_llc != -1 &&
> -		    llc_id(env->src_cpu) == p->preferred_llc)
> +		if (preserve_preferred && detached && llc_id(env->src_cpu) == p->preferred_llc) {
> +			env->flags |= LBF_ALL_PINNED;
>  			break;
> +		}
>  #endif
>  

You have a good point.  If all the remaining tasks on the rq prefer src_llc,
we should find an alternative rq to pull tasks from by specifying LBF_ALL_PINNED.

This policy makes sense for the migrate_task and migrate_llc_task case.
I have to think about the migrate_util case, where the source group is overloaded
and dst group has spare capacity, and tasks in the source group are in their
preferred LLC.

Tim

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2025-11-03 22:07 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-11 18:24 [PATCH 00/19] Cache Aware Scheduling Tim Chen
2025-10-11 18:24 ` [PATCH 01/19] sched/fair: Add infrastructure for cache-aware load balancing Tim Chen
2025-10-14 19:12   ` Madadi Vineeth Reddy
2025-10-15  4:54     ` Chen, Yu C
2025-10-15 19:32       ` Tim Chen
2025-10-16  3:11         ` Chen, Yu C
2025-10-15 11:54     ` Peter Zijlstra
2025-10-15 16:07       ` Chen, Yu C
2025-10-23  7:26   ` kernel test robot
2025-10-27  4:47   ` K Prateek Nayak
2025-10-27 13:35     ` Chen, Yu C
2025-10-11 18:24 ` [PATCH 02/19] sched/fair: Record per-LLC utilization to guide cache-aware scheduling decisions Tim Chen
2025-10-15 10:15   ` Peter Zijlstra
2025-10-15 16:27     ` Chen, Yu C
2025-10-27  5:01   ` K Prateek Nayak
2025-10-27 14:07     ` Chen, Yu C
2025-10-28  2:50       ` K Prateek Nayak
2025-10-11 18:24 ` [PATCH 03/19] sched/fair: Introduce helper functions to enforce LLC migration policy Tim Chen
2025-10-11 18:24 ` [PATCH 04/19] sched/fair: Introduce a static key to enable cache aware only for multi LLCs Tim Chen
2025-10-15 11:04   ` Peter Zijlstra
2025-10-15 16:25     ` Chen, Yu C
2025-10-15 16:36       ` Shrikanth Hegde
2025-10-15 17:01         ` Chen, Yu C
2025-10-16  7:42           ` Peter Zijlstra
2025-10-17  2:08             ` Chen, Yu C
2025-10-16  7:40       ` Peter Zijlstra
2025-10-27  5:42   ` K Prateek Nayak
2025-10-27 12:56     ` Chen, Yu C
2025-10-27 23:36       ` Tim Chen
2025-10-29 12:36         ` Chen, Yu C
2025-10-28  2:46       ` K Prateek Nayak
2025-10-11 18:24 ` [PATCH 05/19] sched/fair: Add LLC index mapping for CPUs Tim Chen
2025-10-15 11:08   ` Peter Zijlstra
2025-10-15 11:58   ` Peter Zijlstra
2025-10-15 20:12     ` Tim Chen
2025-10-11 18:24 ` [PATCH 06/19] sched/fair: Assign preferred LLC ID to processes Tim Chen
2025-10-14  5:16   ` Chen, Yu C
2025-10-15 11:15     ` Peter Zijlstra
2025-10-16  3:13       ` Chen, Yu C
2025-10-17  4:50       ` Chen, Yu C
2025-10-20  9:41         ` Vern Hao
2025-10-11 18:24 ` [PATCH 07/19] sched/fair: Track LLC-preferred tasks per runqueue Tim Chen
2025-10-15 12:05   ` Peter Zijlstra
2025-10-15 20:03     ` Tim Chen
2025-10-16  7:44       ` Peter Zijlstra
2025-10-16 20:06         ` Tim Chen
2025-10-27  6:04   ` K Prateek Nayak
2025-10-28 15:15     ` Chen, Yu C
2025-10-28 15:46       ` Tim Chen
2025-10-29  4:32         ` K Prateek Nayak
2025-10-29 12:48           ` Chen, Yu C
2025-10-29  4:00       ` K Prateek Nayak
2025-10-28 17:06     ` Tim Chen
2025-10-11 18:24 ` [PATCH 08/19] sched/fair: Introduce per runqueue task LLC preference counter Tim Chen
2025-10-15 12:21   ` Peter Zijlstra
2025-10-15 20:41     ` Tim Chen
2025-10-16  7:49       ` Peter Zijlstra
2025-10-21  8:28       ` Madadi Vineeth Reddy
2025-10-23  6:07         ` Chen, Yu C
2025-10-11 18:24 ` [PATCH 09/19] sched/fair: Count tasks prefering each LLC in a sched group Tim Chen
2025-10-15 12:22   ` Peter Zijlstra
2025-10-15 20:42     ` Tim Chen
2025-10-15 12:25   ` Peter Zijlstra
2025-10-15 20:43     ` Tim Chen
2025-10-27  8:33   ` K Prateek Nayak
2025-10-27 23:19     ` Tim Chen
2025-10-11 18:24 ` [PATCH 10/19] sched/fair: Prioritize tasks preferring destination LLC during balancing Tim Chen
2025-10-15  7:23   ` kernel test robot
2025-10-15 15:08   ` Peter Zijlstra
2025-10-15 21:28     ` Tim Chen
2025-10-15 15:10   ` Peter Zijlstra
2025-10-15 16:03     ` Chen, Yu C
2025-10-24  9:32   ` Aaron Lu
2025-10-27  2:00     ` Chen, Yu C
2025-10-29  9:51       ` Aaron Lu
2025-10-29 13:19         ` Chen, Yu C
2025-10-27  6:29   ` K Prateek Nayak
2025-10-28 12:11     ` Chen, Yu C
2025-10-11 18:24 ` [PATCH 11/19] sched/fair: Identify busiest sched_group for LLC-aware load balancing Tim Chen
2025-10-15 15:24   ` Peter Zijlstra
2025-10-15 21:18     ` Tim Chen
2025-10-11 18:24 ` [PATCH 12/19] sched/fair: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
2025-10-27  9:04   ` K Prateek Nayak
2025-10-27 22:59     ` Tim Chen
2025-10-11 18:24 ` [PATCH 13/19] sched/fair: Handle moving single tasks to/from their preferred LLC Tim Chen
2025-10-11 18:24 ` [PATCH 14/19] sched/fair: Consider LLC preference when selecting tasks for load balancing Tim Chen
2025-10-11 18:24 ` [PATCH 15/19] sched/fair: Respect LLC preference in task migration and detach Tim Chen
2025-10-28  6:02   ` K Prateek Nayak
2025-10-28 11:58     ` Chen, Yu C
2025-10-28 15:30       ` Tim Chen
2025-10-29  4:15         ` K Prateek Nayak
2025-10-29  3:54       ` K Prateek Nayak
2025-10-29 14:23         ` Chen, Yu C
2025-10-29 21:09         ` Tim Chen
2025-10-30  4:19           ` K Prateek Nayak
2025-10-30 20:07             ` Tim Chen
2025-10-31  3:32               ` K Prateek Nayak
2025-10-31 15:17                 ` Chen, Yu C
2025-11-03 21:41                   ` Tim Chen
2025-11-03 22:07                 ` Tim Chen
2025-10-11 18:24 ` [PATCH 16/19] sched/fair: Exclude processes with many threads from cache-aware scheduling Tim Chen
2025-10-23  7:22   ` kernel test robot
2025-10-11 18:24 ` [PATCH 17/19] sched/fair: Disable cache aware scheduling for processes with high thread counts Tim Chen
2025-10-22 17:21   ` Madadi Vineeth Reddy
2025-10-23  6:55     ` Chen, Yu C
2025-10-11 18:24 ` [PATCH 18/19] sched/fair: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
2025-10-15  6:57   ` kernel test robot
2025-10-16  4:44     ` Chen, Yu C
2025-10-11 18:24 ` [PATCH 19/19] sched/fair: Add user control to adjust the tolerance of cache-aware scheduling Tim Chen
2025-10-29  8:07   ` Aaron Lu
2025-10-29 12:54     ` Chen, Yu C
2025-10-14 12:13 ` [PATCH 00/19] Cache Aware Scheduling Madadi Vineeth Reddy
2025-10-14 21:48   ` Tim Chen
2025-10-15  5:38     ` Chen, Yu C
2025-10-15 18:26       ` Madadi Vineeth Reddy
2025-10-16  4:57         ` Chen, Yu C

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).