[RFC patch v3 00/20] Cache aware scheduling

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC patch v3 00/20] Cache aware scheduling
@ 2025-06-18 18:27 Tim Chen
  2025-06-18 18:27 ` [RFC patch v3 01/20] sched: Cache aware load-balancing Tim Chen
                   ` (23 more replies)
  0 siblings, 24 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

This is the third revision of the cache aware scheduling patches,
based on the original patch proposed by Peter[1].

The goal of the patch series is to aggregate tasks sharing data
to the same cache domain, thereby reducing cache bouncing and
cache misses, and improve data access efficiency. In the current
implementation, threads within the same process are considered
as entities that potentially share resources.

In previous versions, aggregation of tasks were done in the
wake up path, without making load balancing paths aware of
LLC (Last-Level-Cache) preference. This led to the following
problems:

1) Aggregation of tasks during wake up led to load imbalance
   between LLCs
2) Load balancing tried to even out the load between LLCs
3) Wake up tasks aggregation happened at a faster rate and
   load balancing moved tasks in opposite directions, leading
   to continuous and excessive task migrations and regressions
   in benchmarks like schbench.

In this version, load balancing is made cache-aware. The main
idea of cache-aware load balancing consists of two parts:

1) Identify tasks that prefer to run on their hottest LLC and
   move them there.
2) Prevent generic load balancing from moving a task out of
   its hottest LLC.

By default, LLC task aggregation during wake-up is disabled.
Conversely, cache-aware load balancing is enabled by default.
For easier comparison, two scheduler features are introduced:
SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
wake up and cache-aware load balancing, respectively. By default,
NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
is only done on load balancing.

With above default settings, task migrations occur less frequently
and no longer happen in the latency-sensitive wake-up path.

The load balancing and migration policy are now implemented in
a single location within the function _get_migrate_hint().
Debugfs knobs are also introduced to fine-tune the
_get_migrate_hint() function. Please refer to patch 7 for
detail.

Improvements in performance for hackbench are observed in the
lower load ranges when tested on a 2 socket sapphire rapids with
30 cores per socket. The DRAM interleaving is enabled in the
BIOS so it essential has one NUMA node with two last level
caches. Hackbench benefits from having all the threads
in the process running in the same LLC. There are some small
regressions for the heavily loaded case when not all threads can
fit in a LLC.

Hackbench is run with one process, and pairs of threads ping
ponging message off each other via command with increasing number
of thread pairs, each test runs for 10 cycles:

hackbench -g 1 --thread --pipe(socket) -l 1000000 -s 100 -f <pairs>

case                    load            baseline(std%)  compare%( std%)
threads-pipe-8          1-groups         1.00 (  2.70)  +24.51 (  0.59)
threads-pipe-15         1-groups         1.00 (  1.42)  +28.37 (  0.68)
threads-pipe-30         1-groups         1.00 (  2.53)  +26.16 (  0.11)
threads-pipe-45         1-groups         1.00 (  0.48)  +35.38 (  0.18)
threads-pipe-60         1-groups         1.00 (  2.13)  +13.46 ( 12.81)
threads-pipe-75         1-groups         1.00 (  1.57)  +16.71 (  0.20)
threads-pipe-90         1-groups         1.00 (  0.22)   -0.57 (  1.21)
threads-sockets-8       1-groups         1.00 (  2.82)  +23.04 (  0.83)
threads-sockets-15      1-groups         1.00 (  2.57)  +21.67 (  1.90)
threads-sockets-30      1-groups         1.00 (  0.75)  +18.78 (  0.09)
threads-sockets-45      1-groups         1.00 (  1.63)  +18.89 (  0.43)
threads-sockets-60      1-groups         1.00 (  0.66)  +10.10 (  1.91)
threads-sockets-75      1-groups         1.00 (  0.44)  -14.49 (  0.43)
threads-sockets-90      1-groups         1.00 (  0.15)   -8.03 (  3.88)

Similar tests were also experimented on schbench on the system.
Overall latency improvement is observed when underloaded and
regression when overloaded. The regression is significantly
smaller than the previous version because cache aware aggregation
is in load balancing rather than in wake up path. Besides, it is
found that the schbench seems to have large run-to-run variance,
so the result of schbench might be only used as reference.

schbench:
                                   baseline              nowake_lb
Lat 50.0th-qrtle-1          5.00 (   0.00%)        5.00 (   0.00%)
Lat 90.0th-qrtle-1          9.00 (   0.00%)        8.00 (  11.11%)
Lat 99.0th-qrtle-1         15.00 (   0.00%)       15.00 (   0.00%)
Lat 99.9th-qrtle-1         32.00 (   0.00%)       23.00 (  28.12%)
Lat 20.0th-qrtle-1        267.00 (   0.00%)      266.00 (   0.37%)
Lat 50.0th-qrtle-2          8.00 (   0.00%)        4.00 (  50.00%)
Lat 90.0th-qrtle-2          9.00 (   0.00%)        7.00 (  22.22%)
Lat 99.0th-qrtle-2         18.00 (   0.00%)       11.00 (  38.89%)
Lat 99.9th-qrtle-2         26.00 (   0.00%)       25.00 (   3.85%)
Lat 20.0th-qrtle-2        535.00 (   0.00%)      537.00 (  -0.37%)
Lat 50.0th-qrtle-4          6.00 (   0.00%)        4.00 (  33.33%)
Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
Lat 99.0th-qrtle-4         13.00 (   0.00%)       10.00 (  23.08%)
Lat 99.9th-qrtle-4         20.00 (   0.00%)       14.00 (  30.00%)
Lat 20.0th-qrtle-4       1066.00 (   0.00%)     1050.00 (   1.50%)
Lat 50.0th-qrtle-8          5.00 (   0.00%)        4.00 (  20.00%)
Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
Lat 99.0th-qrtle-8         11.00 (   0.00%)        8.00 (  27.27%)
Lat 99.9th-qrtle-8         17.00 (   0.00%)       18.00 (  -5.88%)
Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2156.00 (  -0.75%)
Lat 50.0th-qrtle-16         6.00 (   0.00%)        4.00 (  33.33%)
Lat 90.0th-qrtle-16         7.00 (   0.00%)        6.00 (  14.29%)
Lat 99.0th-qrtle-16        11.00 (   0.00%)       11.00 (   0.00%)
Lat 99.9th-qrtle-16        18.00 (   0.00%)       18.00 (   0.00%)
Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4216.00 (   1.86%)
Lat 50.0th-qrtle-32         6.00 (   0.00%)        4.00 (  33.33%)
Lat 90.0th-qrtle-32         7.00 (   0.00%)        5.00 (  28.57%)
Lat 99.0th-qrtle-32        11.00 (   0.00%)        9.00 (  18.18%)
Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8624.00 (  -1.51%)
Lat 50.0th-qrtle-64         5.00 (   0.00%)        5.00 (   0.00%)
Lat 90.0th-qrtle-64         7.00 (   0.00%)        7.00 (   0.00%)
Lat 99.0th-qrtle-64        11.00 (   0.00%)       11.00 (   0.00%)
Lat 99.9th-qrtle-64        17.00 (   0.00%)       18.00 (  -5.88%)
Lat 20.0th-qrtle-64     17120.00 (   0.00%)    15728.00 (   8.13%)
Lat 50.0th-qrtle-128        6.00 (   0.00%)        6.00 (   0.00%)
Lat 90.0th-qrtle-128        9.00 (   0.00%)        8.00 (  11.11%)
Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
Lat 99.9th-qrtle-128       20.00 (   0.00%)       26.00 ( -30.00%)
Lat 20.0th-qrtle-128    19488.00 (   0.00%)    18784.00 (   3.61%)
Lat 50.0th-qrtle-239        8.00 (   0.00%)        8.00 (   0.00%)
Lat 90.0th-qrtle-239       16.00 (   0.00%)       14.00 (  12.50%)
Lat 99.0th-qrtle-239       45.00 (   0.00%)       41.00 (   8.89%)
Lat 99.9th-qrtle-239      137.00 (   0.00%)      225.00 ( -64.23%)
Lat 20.0th-qrtle-239    30432.00 (   0.00%)    29920.00 (   1.68%)

AMD Milan is also tested. There are 4 Nodes and 32 CPUs per node.
Each node has 4 CCX(shared LLC) and each CCX has 8 CPUs. Hackbench
with 1 group test scenario benefits from cache aware load balance
too:

hackbench(1 group and fd ranges in [1,6]:
case                    load            baseline(std%)  compare%( std%)
threads-pipe-1          1-groups         1.00 (  1.22)   +2.84 (  0.51)
threads-pipe-2          1-groups         1.00 (  5.82)  +42.82 ( 43.61)
threads-pipe-3          1-groups         1.00 (  3.49)  +17.33 ( 18.68)
threads-pipe-4          1-groups         1.00 (  2.49)  +12.49 (  5.89)
threads-pipe-5          1-groups         1.00 (  1.46)   +8.62 (  4.43)
threads-pipe-6          1-groups         1.00 (  2.83)  +12.73 (  8.94)
threads-sockets-1       1-groups         1.00 (  1.31)  +28.68 (  2.25)
threads-sockets-2       1-groups         1.00 (  5.17)  +34.84 ( 36.90)
threads-sockets-3       1-groups         1.00 (  1.57)   +9.15 (  5.52)
threads-sockets-4       1-groups         1.00 (  1.99)  +16.51 (  6.04)
threads-sockets-5       1-groups         1.00 (  2.39)  +10.88 (  2.17)
threads-sockets-6       1-groups         1.00 (  1.62)   +7.22 (  2.00)

Besides a single instance of hackbench, four instances of hackbench are
also tested on Milan. The test results show that different instances of
hackbench are aggregated to dedicated LLCs, and performance improvement
is observed.

schbench mmtests(unstable)
                                  baseline              nowake_lb
Lat 50.0th-qrtle-1         9.00 (   0.00%)        8.00 (  11.11%)
Lat 90.0th-qrtle-1        12.00 (   0.00%)       10.00 (  16.67%)
Lat 99.0th-qrtle-1        16.00 (   0.00%)       14.00 (  12.50%)
Lat 99.9th-qrtle-1        22.00 (   0.00%)       21.00 (   4.55%)
Lat 20.0th-qrtle-1       759.00 (   0.00%)      759.00 (   0.00%)
Lat 50.0th-qrtle-2         9.00 (   0.00%)        7.00 (  22.22%)
Lat 90.0th-qrtle-2        12.00 (   0.00%)       12.00 (   0.00%)
Lat 99.0th-qrtle-2        16.00 (   0.00%)       15.00 (   6.25%)
Lat 99.9th-qrtle-2        22.00 (   0.00%)       21.00 (   4.55%)
Lat 20.0th-qrtle-2      1534.00 (   0.00%)     1510.00 (   1.56%)
Lat 50.0th-qrtle-4         8.00 (   0.00%)        9.00 ( -12.50%)
Lat 90.0th-qrtle-4        12.00 (   0.00%)       12.00 (   0.00%)
Lat 99.0th-qrtle-4        15.00 (   0.00%)       16.00 (  -6.67%)
Lat 99.9th-qrtle-4        21.00 (   0.00%)       23.00 (  -9.52%)
Lat 20.0th-qrtle-4      3076.00 (   0.00%)     2860.00 (   7.02%)
Lat 50.0th-qrtle-8        10.00 (   0.00%)        9.00 (  10.00%)
Lat 90.0th-qrtle-8        12.00 (   0.00%)       13.00 (  -8.33%)
Lat 99.0th-qrtle-8        17.00 (   0.00%)       17.00 (   0.00%)
Lat 99.9th-qrtle-8        22.00 (   0.00%)       24.00 (  -9.09%)
Lat 20.0th-qrtle-8      6232.00 (   0.00%)     5896.00 (   5.39%)
Lat 50.0th-qrtle-16        9.00 (   0.00%)        9.00 (   0.00%)
Lat 90.0th-qrtle-16       13.00 (   0.00%)       13.00 (   0.00%)
Lat 99.0th-qrtle-16       17.00 (   0.00%)       18.00 (  -5.88%)
Lat 99.9th-qrtle-16       23.00 (   0.00%)       26.00 ( -13.04%)
Lat 20.0th-qrtle-16    10096.00 (   0.00%)    10352.00 (  -2.54%)
Lat 50.0th-qrtle-32       15.00 (   0.00%)       15.00 (   0.00%)
Lat 90.0th-qrtle-32       25.00 (   0.00%)       26.00 (  -4.00%)
Lat 99.0th-qrtle-32       49.00 (   0.00%)       50.00 (  -2.04%)
Lat 99.9th-qrtle-32      945.00 (   0.00%)     1005.00 (  -6.35%)
Lat 20.0th-qrtle-32    11600.00 (   0.00%)    11632.00 (  -0.28%)

Netperf/Tbench have not been tested yet. As they are single-process
benchmarks that are not the target of this cache-aware scheduling.
Additionally, client and server components should be tested on
different machines or bound to different nodes. Otherwise,
cache-aware scheduling might harm their performance: placing client
and server in the same LLC could yield higher throughput due to
improved cache locality in the TCP/IP stack, whereas cache-aware
scheduling aims to place them in dedicated LLCs.

This patch set is applied on v6.15 kernel.

There are some further work needed for future versions in this
patch set.  We will need to align NUMA balancing with LLC aggregations
such that LLC aggregation will align with the preferred NUMA node.

Comments and tests are much appreciated.

[1] https://lore.kernel.org/all/20250325120952.GJ36322@noisy.programming.kicks-ass.net/

The patches are grouped as follow:
Patch 1:     Peter's original patch.
Patch 2-5:   Various fixes and tuning of the original v1 patch.
Patch 6-12:  Infrastructure and helper functions for load balancing to be cache aware.
Patch 13-18: Add logic to load balancing for preferred LLC aggregation.
Patch 19:    Add process LLC aggregation in load balancing sched feature.
Patch 20:    Add Process LLC aggregation in wake up sched feature (turn off by default).

v1:
https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
v2:
https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/

Chen Yu (3):
  sched: Several fixes for cache aware scheduling
  sched: Avoid task migration within its preferred LLC
  sched: Save the per LLC utilization for better cache aware scheduling

K Prateek Nayak (1):
  sched: Avoid calculating the cpumask if the system is overloaded

Peter Zijlstra (1):
  sched: Cache aware load-balancing

Tim Chen (15):
  sched: Add hysteresis to switch a task's preferred LLC
  sched: Add helper function to decide whether to allow cache aware
    scheduling
  sched: Set up LLC indexing
  sched: Introduce task preferred LLC field
  sched: Calculate the number of tasks that have LLC preference on a
    runqueue
  sched: Introduce per runqueue task LLC preference counter
  sched: Calculate the total number of preferred LLC tasks during load
    balance
  sched: Tag the sched group as llc_balance if it has tasks prefer other
    LLC
  sched: Introduce update_llc_busiest() to deal with groups having
    preferred LLC tasks
  sched: Introduce a new migration_type to track the preferred LLC load
    balance
  sched: Consider LLC locality for active balance
  sched: Consider LLC preference when picking tasks from busiest queue
  sched: Do not migrate task if it is moving out of its preferred LLC
  sched: Introduce SCHED_CACHE_LB to control cache aware load balance
  sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
    up

 include/linux/mm_types.h       |  44 ++
 include/linux/sched.h          |   8 +
 include/linux/sched/topology.h |   3 +
 init/Kconfig                   |   4 +
 init/init_task.c               |   3 +
 kernel/fork.c                  |   5 +
 kernel/sched/core.c            |  25 +-
 kernel/sched/debug.c           |   4 +
 kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
 kernel/sched/features.h        |   3 +
 kernel/sched/sched.h           |  23 +
 kernel/sched/topology.c        |  29 ++
 12 files changed, 982 insertions(+), 28 deletions(-)

-- 
2.32.0

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [RFC patch v3 01/20] sched: Cache aware load-balancing
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
@ 2025-06-18 18:27 ` Tim Chen
  2025-06-26 12:23   ` Jianyong Wu
  2025-07-03 19:29   ` Shrikanth Hegde
  2025-06-18 18:27 ` [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling Tim Chen
                   ` (22 subsequent siblings)
  23 siblings, 2 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Chen Yu

From: Peter Zijlstra <peterz@infradead.org>

Hi all,

One of the many things on the eternal todo list has been finishing the
below hackery.

It is an attempt at modelling cache affinity -- and while the patch
really only targets LLC, it could very well be extended to also apply to
clusters (L2). Specifically any case of multiple cache domains inside a
node.

Anyway, I wrote this about a year ago, and I mentioned this at the
recent OSPM conf where Gautham and Prateek expressed interest in playing
with this code.

So here goes, very rough and largely unproven code ahead :-)

It applies to current tip/master, but I know it will fail the __percpu
validation that sits in -next, although that shouldn't be terribly hard
to fix up.

As is, it only computes a CPU inside the LLC that has the highest recent
runtime, this CPU is then used in the wake-up path to steer towards this
LLC and in task_hot() to limit migrations away from it.

More elaborate things could be done, notably there is an XXX in there
somewhere about finding the best LLC inside a NODE (interaction with
NUMA_BALANCING).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/mm_types.h |  44 ++++++
 include/linux/sched.h    |   4 +
 init/Kconfig             |   4 +
 kernel/fork.c            |   5 +
 kernel/sched/core.c      |  13 +-
 kernel/sched/fair.c      | 330 +++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h     |   8 +
 7 files changed, 388 insertions(+), 20 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f9..013291c6aaa2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -893,6 +893,12 @@ struct mm_cid {
 };
 #endif
 
+struct mm_sched {
+	u64 runtime;
+	unsigned long epoch;
+	unsigned long occ;
+};
+
 struct kioctx_table;
 struct iommu_mm_data;
 struct mm_struct {
@@ -983,6 +989,17 @@ struct mm_struct {
 		 */
 		raw_spinlock_t cpus_allowed_lock;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		/*
+		 * Track per-cpu-per-process occupancy as a proxy for cache residency.
+		 * See account_mm_sched() and ...
+		 */
+		struct mm_sched __percpu *pcpu_sched;
+		raw_spinlock_t mm_sched_lock;
+		unsigned long mm_sched_epoch;
+		int mm_sched_cpu;
+#endif
+
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
 #endif
@@ -1393,6 +1410,33 @@ static inline unsigned int mm_cid_size(void)
 static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
 #endif /* CONFIG_SCHED_MM_CID */
 
+#ifdef CONFIG_SCHED_CACHE
+extern void mm_init_sched(struct mm_struct *mm, struct mm_sched *pcpu_sched);
+
+static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
+{
+	struct mm_sched *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
+	if (!pcpu_sched)
+		return -ENOMEM;
+
+	mm_init_sched(mm, pcpu_sched);
+	return 0;
+}
+
+#define mm_alloc_sched(...)	alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__))
+
+static inline void mm_destroy_sched(struct mm_struct *mm)
+{
+	free_percpu(mm->pcpu_sched);
+	mm->pcpu_sched = NULL;
+}
+#else /* !CONFIG_SCHED_CACHE */
+
+static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; }
+static inline void mm_destroy_sched(struct mm_struct *mm) { }
+
+#endif /* CONFIG_SCHED_CACHE */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..d0e4cda2b3cd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1399,6 +1399,10 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_SCHED_CACHE
+	struct callback_head		cache_work;
+#endif
+
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
 	u32 rseq_len;
diff --git a/init/Kconfig b/init/Kconfig
index bf3a920064be..e2509127b6f9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -953,6 +953,10 @@ config NUMA_BALANCING
 
 	  This system will be inactive on UMA systems.
 
+config SCHED_CACHE
+	bool "Cache aware scheduler"
+	default y
+
 config NUMA_BALANCING_DEFAULT_ENABLED
 	bool "Automatically enable NUMA aware memory/task placement"
 	default y
diff --git a/kernel/fork.c b/kernel/fork.c
index 168681fc4b25..da1387823b9e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1332,6 +1332,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_cid(mm, p))
 		goto fail_cid;
 
+	if (mm_alloc_sched(mm))
+		goto fail_sched;
+
 	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
 				     NR_MM_COUNTERS))
 		goto fail_pcpu;
@@ -1341,6 +1344,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	return mm;
 
 fail_pcpu:
+	mm_destroy_sched(mm);
+fail_sched:
 	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c81cf642dba0..d9c3e75f79d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4524,6 +4524,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->migration_pending = NULL;
 #endif
 	init_sched_mm_cid(p);
+	init_sched_mm(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -8526,6 +8527,7 @@ static struct kmem_cache *task_group_cache __ro_after_init;
 
 void __init sched_init(void)
 {
+	unsigned long now = jiffies;
 	unsigned long ptr = 0;
 	int i;
 
@@ -8600,7 +8602,7 @@ void __init sched_init(void)
 		raw_spin_lock_init(&rq->__lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
-		rq->calc_load_update = jiffies + LOAD_FREQ;
+		rq->calc_load_update = now + LOAD_FREQ;
 		init_cfs_rq(&rq->cfs);
 		init_rt_rq(&rq->rt);
 		init_dl_rq(&rq->dl);
@@ -8644,7 +8646,7 @@ void __init sched_init(void)
 		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
 		rq->balance_callback = &balance_push_callback;
 		rq->active_balance = 0;
-		rq->next_balance = jiffies;
+		rq->next_balance = now;
 		rq->push_cpu = 0;
 		rq->cpu = i;
 		rq->online = 0;
@@ -8656,7 +8658,7 @@ void __init sched_init(void)
 
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ_COMMON
-		rq->last_blocked_load_update_tick = jiffies;
+		rq->last_blocked_load_update_tick = now;
 		atomic_set(&rq->nohz_flags, 0);
 
 		INIT_CSD(&rq->nohz_csd, nohz_csd_func, rq);
@@ -8681,6 +8683,11 @@ void __init sched_init(void)
 
 		rq->core_cookie = 0UL;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		raw_spin_lock_init(&rq->cpu_epoch_lock);
+		rq->cpu_epoch_next = now;
+#endif
+
 		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995a47..df7d4a324fbe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1166,10 +1166,229 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
 	return delta_exec;
 }
 
-static inline void update_curr_task(struct task_struct *p, s64 delta_exec)
+#ifdef CONFIG_SCHED_CACHE
+
+/*
+ * XXX numbers come from a place the sun don't shine -- probably wants to be SD
+ * tunable or so.
+ */
+#define EPOCH_PERIOD	(HZ/100)	/* 10 ms */
+#define EPOCH_OLD	5		/* 50 ms */
+
+void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched)
+{
+	unsigned long epoch;
+	int i;
+
+	for_each_possible_cpu(i) {
+		struct mm_sched *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
+		struct rq *rq = cpu_rq(i);
+
+		pcpu_sched->runtime = 0;
+		pcpu_sched->epoch = epoch = rq->cpu_epoch;
+		pcpu_sched->occ = -1;
+	}
+
+	raw_spin_lock_init(&mm->mm_sched_lock);
+	mm->mm_sched_epoch = epoch;
+	mm->mm_sched_cpu = -1;
+
+	smp_store_release(&mm->pcpu_sched, _pcpu_sched);
+}
+
+/* because why would C be fully specified */
+static __always_inline void __shr_u64(u64 *val, unsigned int n)
+{
+	if (n >= 64) {
+		*val = 0;
+		return;
+	}
+	*val >>= n;
+}
+
+static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
+{
+	lockdep_assert_held(&rq->cpu_epoch_lock);
+
+	unsigned long n, now = jiffies;
+	long delta = now - rq->cpu_epoch_next;
+
+	if (delta > 0) {
+		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		rq->cpu_epoch += n;
+		rq->cpu_epoch_next += n * EPOCH_PERIOD;
+		__shr_u64(&rq->cpu_runtime, n);
+	}
+
+	n = rq->cpu_epoch - pcpu_sched->epoch;
+	if (n) {
+		pcpu_sched->epoch += n;
+		__shr_u64(&pcpu_sched->runtime, n);
+	}
+}
+
+static unsigned long fraction_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
+{
+	guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
+
+	__update_mm_sched(rq, pcpu_sched);
+
+	/*
+	 * Runtime is a geometric series (r=0.5) and as such will sum to twice
+	 * the accumulation period, this means the multiplcation here should
+	 * not overflow.
+	 */
+	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
+}
+
+static inline
+void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
+{
+	struct mm_struct *mm = p->mm;
+	struct mm_sched *pcpu_sched;
+	unsigned long epoch;
+
+	/*
+	 * init_task and kthreads don't be having no mm
+	 */
+	if (!mm || !mm->pcpu_sched)
+		return;
+
+	pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
+
+	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
+		__update_mm_sched(rq, pcpu_sched);
+		pcpu_sched->runtime += delta_exec;
+		rq->cpu_runtime += delta_exec;
+		epoch = rq->cpu_epoch;
+	}
+
+	/*
+	 * If this task hasn't hit task_cache_work() for a while, invalidate
+	 * it's preferred state.
+	 */
+	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
+		mm->mm_sched_cpu = -1;
+		pcpu_sched->occ = -1;
+	}
+}
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p)
+{
+	struct callback_head *work = &p->cache_work;
+	struct mm_struct *mm = p->mm;
+
+	if (!mm || !mm->pcpu_sched)
+		return;
+
+	if (mm->mm_sched_epoch == rq->cpu_epoch)
+		return;
+
+	guard(raw_spinlock)(&mm->mm_sched_lock);
+
+	if (mm->mm_sched_epoch == rq->cpu_epoch)
+		return;
+
+	if (work->next == work) {
+		task_work_add(p, work, TWA_RESUME);
+		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
+	}
+}
+
+static void task_cache_work(struct callback_head *work)
+{
+	struct task_struct *p = current;
+	struct mm_struct *mm = p->mm;
+	unsigned long m_a_occ = 0;
+	int cpu, m_a_cpu = -1;
+	cpumask_var_t cpus;
+
+	WARN_ON_ONCE(work != &p->cache_work);
+
+	work->next = work;
+
+	if (p->flags & PF_EXITING)
+		return;
+
+	if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
+		return;
+
+	scoped_guard (cpus_read_lock) {
+		cpumask_copy(cpus, cpu_online_mask);
+
+		for_each_cpu(cpu, cpus) {
+			/* XXX sched_cluster_active */
+			struct sched_domain *sd = per_cpu(sd_llc, cpu);
+			unsigned long occ, m_occ = 0, a_occ = 0;
+			int m_cpu = -1, nr = 0, i;
+
+			for_each_cpu(i, sched_domain_span(sd)) {
+				occ = fraction_mm_sched(cpu_rq(i),
+							per_cpu_ptr(mm->pcpu_sched, i));
+				a_occ += occ;
+				if (occ > m_occ) {
+					m_occ = occ;
+					m_cpu = i;
+				}
+				nr++;
+				trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: %d\n",
+					     per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
+			}
+
+			a_occ /= nr;
+			if (a_occ > m_a_occ) {
+				m_a_occ = a_occ;
+				m_a_cpu = m_cpu;
+			}
+
+			trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
+				     per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
+
+			for_each_cpu(i, sched_domain_span(sd)) {
+				/* XXX threshold ? */
+				per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
+			}
+
+			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+		}
+	}
+
+	/*
+	 * If the max average cache occupancy is 'small' we don't care.
+	 */
+	if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
+		m_a_cpu = -1;
+
+	mm->mm_sched_cpu = m_a_cpu;
+
+	free_cpumask_var(cpus);
+}
+
+void init_sched_mm(struct task_struct *p)
+{
+	struct callback_head *work = &p->cache_work;
+	init_task_work(work, task_cache_work);
+	work->next = work;
+}
+
+#else
+
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
+				    s64 delta_exec) { }
+
+
+void init_sched_mm(struct task_struct *p) { }
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
+
+#endif
+
+static inline
+void update_curr_task(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
 	trace_sched_stat_runtime(p, delta_exec);
 	account_group_exec_runtime(p, delta_exec);
+	account_mm_sched(rq, p, delta_exec);
 	cgroup_account_cputime(p, delta_exec);
 }
 
@@ -1215,7 +1434,7 @@ s64 update_curr_common(struct rq *rq)
 
 	delta_exec = update_curr_se(rq, &donor->se);
 	if (likely(delta_exec > 0))
-		update_curr_task(donor, delta_exec);
+		update_curr_task(rq, donor, delta_exec);
 
 	return delta_exec;
 }
@@ -1244,7 +1463,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	if (entity_is_task(curr)) {
 		struct task_struct *p = task_of(curr);
 
-		update_curr_task(p, delta_exec);
+		update_curr_task(rq, p, delta_exec);
 
 		/*
 		 * If the fair_server is active, we need to account for the
@@ -7848,7 +8067,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 * per-cpu select_rq_mask usage
 	 */
 	lockdep_assert_irqs_disabled();
-
+again:
 	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, target))
 		return target;
@@ -7886,7 +8105,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	/* Check a recently used CPU as a potential idle candidate: */
 	recent_used_cpu = p->recent_used_cpu;
 	p->recent_used_cpu = prev;
-	if (recent_used_cpu != prev &&
+	if (prev == p->wake_cpu &&
+	    recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
 	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
@@ -7939,6 +8159,18 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	if (prev != p->wake_cpu && !cpus_share_cache(prev, p->wake_cpu)) {
+		/*
+		 * Most likely select_cache_cpu() will have re-directed
+		 * the wakeup, but getting here means the preferred cache is
+		 * too busy, so re-try with the actual previous.
+		 *
+		 * XXX wake_affine is lost for this pass.
+		 */
+		prev = target = p->wake_cpu;
+		goto again;
+	}
+
 	/*
 	 * For cluster machines which have lower sharing cache like L2 or
 	 * LLC Tag, we tend to find an idle CPU in the target's cluster
@@ -8561,6 +8793,40 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	return target;
 }
 
+#ifdef CONFIG_SCHED_CACHE
+static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
+
+static int select_cache_cpu(struct task_struct *p, int prev_cpu)
+{
+	struct mm_struct *mm = p->mm;
+	int cpu;
+
+	if (!mm || p->nr_cpus_allowed == 1)
+		return prev_cpu;
+
+	cpu = mm->mm_sched_cpu;
+	if (cpu < 0)
+		return prev_cpu;
+
+
+	if (static_branch_likely(&sched_numa_balancing) &&
+	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
+		/*
+		 * XXX look for max occupancy inside prev_cpu's node
+		 */
+		return prev_cpu;
+	}
+
+	return cpu;
+}
+#else
+static int select_cache_cpu(struct task_struct *p, int prev_cpu)
+{
+	return prev_cpu;
+}
+#endif
+
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
@@ -8586,6 +8852,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	 * required for stable ->cpus_allowed
 	 */
 	lockdep_assert_held(&p->pi_lock);
+	guard(rcu)();
+
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
@@ -8593,6 +8861,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		    cpumask_test_cpu(cpu, p->cpus_ptr))
 			return cpu;
 
+		new_cpu = prev_cpu = select_cache_cpu(p, prev_cpu);
+
 		if (!is_rd_overutilized(this_rq()->rd)) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
@@ -8603,7 +8873,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
 	}
 
-	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		/*
 		 * If both 'cpu' and 'prev_cpu' are part of this domain,
@@ -8636,7 +8905,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		/* Fast path */
 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 	}
-	rcu_read_unlock();
 
 	return new_cpu;
 }
@@ -9286,6 +9554,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 	if (sysctl_sched_migration_cost == 0)
 		return 0;
 
+#ifdef CONFIG_SCHED_CACHE
+	if (p->mm && p->mm->pcpu_sched) {
+		/*
+		 * XXX things like Skylake have non-inclusive L3 and might not
+		 * like this L3 centric view. What to do about L2 stickyness ?
+		 */
+		return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
+		       per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
+	}
+#endif
+
 	delta = rq_clock_task(env->src_rq) - p->se.exec_start;
 
 	return delta < (s64)sysctl_sched_migration_cost;
@@ -9297,27 +9576,25 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
  * Returns 0, if task migration is not affected by locality.
  * Returns a negative value, if task migration improves locality i.e migration preferred.
  */
-static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle)
 {
 	struct numa_group *numa_group = rcu_dereference(p->numa_group);
 	unsigned long src_weight, dst_weight;
 	int src_nid, dst_nid, dist;
 
-	if (!static_branch_likely(&sched_numa_balancing))
-		return 0;
-
-	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+	if (!p->numa_faults)
 		return 0;
 
-	src_nid = cpu_to_node(env->src_cpu);
-	dst_nid = cpu_to_node(env->dst_cpu);
+	src_nid = cpu_to_node(src_cpu);
+	dst_nid = cpu_to_node(dst_cpu);
 
 	if (src_nid == dst_nid)
 		return 0;
 
 	/* Migrating away from the preferred node is always bad. */
 	if (src_nid == p->numa_preferred_nid) {
-		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
+		struct rq *src_rq = cpu_rq(src_cpu);
+		if (src_rq->nr_running > src_rq->nr_preferred_running)
 			return 1;
 		else
 			return 0;
@@ -9328,7 +9605,7 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 		return -1;
 
 	/* Leaving a core idle is often worse than degrading locality. */
-	if (env->idle == CPU_IDLE)
+	if (idle)
 		return 0;
 
 	dist = node_distance(src_nid, dst_nid);
@@ -9343,7 +9620,24 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	return src_weight - dst_weight;
 }
 
+static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+	if (!static_branch_likely(&sched_numa_balancing))
+		return 0;
+
+	if (!(env->sd->flags & SD_NUMA))
+		return 0;
+
+	return __migrate_degrades_locality(p, env->src_cpu, env->dst_cpu,
+					   env->idle == CPU_IDLE);
+}
+
 #else
+static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle)
+{
+	return 0;
+}
+
 static inline long migrate_degrades_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
@@ -13102,8 +13396,8 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
  */
 static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 {
-	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &curr->se;
+	struct cfs_rq *cfs_rq;
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
@@ -13113,6 +13407,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
+	task_tick_cache(rq, curr);
+
 	update_misfit_status(curr, rq);
 	check_update_overutilized_status(task_rq(curr));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 47972f34ea70..d16ccd66ca07 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1171,6 +1171,12 @@ struct rq {
 	u64			clock_pelt_idle_copy;
 	u64			clock_idle_copy;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	raw_spinlock_t		cpu_epoch_lock;
+	u64			cpu_runtime;
+	unsigned long		cpu_epoch;
+	unsigned long		cpu_epoch_next;
+#endif
 
 	atomic_t		nr_iowait;
 
@@ -3861,6 +3867,8 @@ static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) { }
 static inline void init_sched_mm_cid(struct task_struct *t) { }
 #endif /* !CONFIG_SCHED_MM_CID */
 
+extern void init_sched_mm(struct task_struct *p);
+
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 #ifdef CONFIG_SMP
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
  2025-06-18 18:27 ` [RFC patch v3 01/20] sched: Cache aware load-balancing Tim Chen
@ 2025-06-18 18:27 ` Tim Chen
  2025-07-03 19:33   ` Shrikanth Hegde
  2025-07-08  1:15   ` Libo Chen
  2025-06-18 18:27 ` [RFC patch v3 03/20] sched: Avoid task migration within its preferred LLC Tim Chen
                   ` (21 subsequent siblings)
  23 siblings, 2 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

1. Fix compile error on percpu allocation.
2. Enqueue to the target CPU rather than the current CPU.
3. NULL LLC sched domain check(Libo Chen).
4. Introduce sched feature SCHED_CACHE to control cache aware scheduling
5. Fix unsigned occupancy initialization to -1.
6. If there is only 1 thread in the process, no need to enable cache
   awareness
7. Add __maybe_unused to __migrate_degrades_locality() to
   avoid compile warnings.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/mm_types.h |  4 ++--
 kernel/sched/fair.c      | 27 ++++++++++++++++-----------
 kernel/sched/features.h  |  1 +
 3 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 013291c6aaa2..9de4a0a13c4d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1411,11 +1411,11 @@ static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumas
 #endif /* CONFIG_SCHED_MM_CID */
 
 #ifdef CONFIG_SCHED_CACHE
-extern void mm_init_sched(struct mm_struct *mm, struct mm_sched *pcpu_sched);
+extern void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sched);
 
 static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
 {
-	struct mm_sched *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
+	struct mm_sched __percpu *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
 	if (!pcpu_sched)
 		return -ENOMEM;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df7d4a324fbe..89db97f8ef02 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1175,7 +1175,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
 #define EPOCH_PERIOD	(HZ/100)	/* 10 ms */
 #define EPOCH_OLD	5		/* 50 ms */
 
-void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched)
+void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
 {
 	unsigned long epoch;
 	int i;
@@ -1186,7 +1186,7 @@ void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched)
 
 		pcpu_sched->runtime = 0;
 		pcpu_sched->epoch = epoch = rq->cpu_epoch;
-		pcpu_sched->occ = -1;
+		pcpu_sched->occ = 0;
 	}
 
 	raw_spin_lock_init(&mm->mm_sched_lock);
@@ -1254,7 +1254,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	if (!mm || !mm->pcpu_sched)
 		return;
 
-	pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
+	pcpu_sched = per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq));
 
 	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
 		__update_mm_sched(rq, pcpu_sched);
@@ -1264,12 +1264,14 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	}
 
 	/*
-	 * If this task hasn't hit task_cache_work() for a while, invalidate
+	 * If this task hasn't hit task_cache_work() for a while, or it
+	 * has only 1 thread, invalidate
 	 * it's preferred state.
 	 */
-	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
+	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD ||
+	    get_nr_threads(p) <= 1) {
 		mm->mm_sched_cpu = -1;
-		pcpu_sched->occ = -1;
+		pcpu_sched->occ = 0;
 	}
 }
 
@@ -1286,9 +1288,6 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
 
 	guard(raw_spinlock)(&mm->mm_sched_lock);
 
-	if (mm->mm_sched_epoch == rq->cpu_epoch)
-		return;
-
 	if (work->next == work) {
 		task_work_add(p, work, TWA_RESUME);
 		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
@@ -1322,6 +1321,9 @@ static void task_cache_work(struct callback_head *work)
 			unsigned long occ, m_occ = 0, a_occ = 0;
 			int m_cpu = -1, nr = 0, i;
 
+			if (!sd)
+				continue;
+
 			for_each_cpu(i, sched_domain_span(sd)) {
 				occ = fraction_mm_sched(cpu_rq(i),
 							per_cpu_ptr(mm->pcpu_sched, i));
@@ -8801,6 +8803,9 @@ static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 	struct mm_struct *mm = p->mm;
 	int cpu;
 
+	if (!sched_feat(SCHED_CACHE))
+		return prev_cpu;
+
 	if (!mm || p->nr_cpus_allowed == 1)
 		return prev_cpu;
 
@@ -9555,7 +9560,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 		return 0;
 
 #ifdef CONFIG_SCHED_CACHE
-	if (p->mm && p->mm->pcpu_sched) {
+	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->pcpu_sched) {
 		/*
 		 * XXX things like Skylake have non-inclusive L3 and might not
 		 * like this L3 centric view. What to do about L2 stickyness ?
@@ -9633,7 +9638,7 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 }
 
 #else
-static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle)
+static __maybe_unused long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle)
 {
 	return 0;
 }
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..d2af7bfd36bf 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -87,6 +87,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_UTIL, true)
 
+SCHED_FEAT(SCHED_CACHE, true)
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
  * in a single rq->lock section. Default disabled because the
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 03/20] sched: Avoid task migration within its preferred LLC
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
  2025-06-18 18:27 ` [RFC patch v3 01/20] sched: Cache aware load-balancing Tim Chen
  2025-06-18 18:27 ` [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling Tim Chen
@ 2025-06-18 18:27 ` Tim Chen
  2025-06-18 18:27 ` [RFC patch v3 04/20] sched: Avoid calculating the cpumask if the system is overloaded Tim Chen
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

It was found that when running schbench, there is a
significant amount of in-LLC task migrations, even if
the wakee is woken up on its preferred LLC. This
leads to core-to-core latency and impairs performance.

Inhibit task migration if the wakee is already in its
preferred LLC.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 89db97f8ef02..567ad2a0cfa2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8813,6 +8813,8 @@ static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 	if (cpu < 0)
 		return prev_cpu;
 
+	if (cpus_share_cache(cpu, prev_cpu))
+		return prev_cpu;
 
 	if (static_branch_likely(&sched_numa_balancing) &&
 	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 04/20] sched: Avoid calculating the cpumask if the system is overloaded
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (2 preceding siblings ...)
  2025-06-18 18:27 ` [RFC patch v3 03/20] sched: Avoid task migration within its preferred LLC Tim Chen
@ 2025-06-18 18:27 ` Tim Chen
  2025-07-03 19:39   ` Shrikanth Hegde
  2025-06-18 18:27 ` [RFC patch v3 05/20] sched: Add hysteresis to switch a task's preferred LLC Tim Chen
                   ` (19 subsequent siblings)
  23 siblings, 1 reply; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Chen Yu

From: K Prateek Nayak <kprateek.nayak@amd.com>

If the SIS_UTIL cuts off idle cpu search, result of the cpumask_and() is
of no use. Since select_idle_cpu() can now be called twice per wake up
in the select_idle_sibling() due to cache aware wake up, this overhead
can be visible in benchmarks like hackbench.

To save some additional cycles, especially in cases where we target
the LLC frequently and the search bails out because the LLC is busy,
only calculate the cpumask if the system is not overloaded.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 567ad2a0cfa2..6a2678f9d44a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7918,8 +7918,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	int i, cpu, idle_cpu = -1, nr = INT_MAX;
 	struct sched_domain_shared *sd_share;
 
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-
 	if (sched_feat(SIS_UTIL)) {
 		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
 		if (sd_share) {
@@ -7931,6 +7929,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		}
 	}
 
+	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
 	if (static_branch_unlikely(&sched_cluster_active)) {
 		struct sched_group *sg = sd->groups;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 05/20] sched: Add hysteresis to switch a task's preferred LLC
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (3 preceding siblings ...)
  2025-06-18 18:27 ` [RFC patch v3 04/20] sched: Avoid calculating the cpumask if the system is overloaded Tim Chen
@ 2025-06-18 18:27 ` Tim Chen
  2025-07-02  6:47   ` Madadi Vineeth Reddy
  2025-06-18 18:27 ` [RFC patch v3 06/20] sched: Save the per LLC utilization for better cache aware scheduling Tim Chen
                   ` (18 subsequent siblings)
  23 siblings, 1 reply; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

Switching a process's preferred LLC generates lots of task
migrations across LLCs. To avoid frequent switches
of home LLC, implement the following policy:

1. Require a 2x occ change threshold to switch preferred LLC
2. Don't discard preferred LLC for a task

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a2678f9d44a..7fb2322c5d9e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1175,6 +1175,14 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
 #define EPOCH_PERIOD	(HZ/100)	/* 10 ms */
 #define EPOCH_OLD	5		/* 50 ms */
 
+static int llc_id(int cpu)
+{
+	if (cpu < 0)
+		return -1;
+
+	return per_cpu(sd_llc_id, cpu);
+}
+
 void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
 {
 	unsigned long epoch;
@@ -1299,6 +1307,7 @@ static void task_cache_work(struct callback_head *work)
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
 	unsigned long m_a_occ = 0;
+	unsigned long last_m_a_occ = 0;
 	int cpu, m_a_cpu = -1;
 	cpumask_var_t cpus;
 
@@ -1337,11 +1346,13 @@ static void task_cache_work(struct callback_head *work)
 					     per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
 			}
 
-			a_occ /= nr;
+			// a_occ /= nr;
 			if (a_occ > m_a_occ) {
 				m_a_occ = a_occ;
 				m_a_cpu = m_cpu;
 			}
+			if (llc_id(cpu) == llc_id(mm->mm_sched_cpu))
+				last_m_a_occ = a_occ;
 
 			trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
 				     per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
@@ -1355,13 +1366,10 @@ static void task_cache_work(struct callback_head *work)
 		}
 	}
 
-	/*
-	 * If the max average cache occupancy is 'small' we don't care.
-	 */
-	if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
-		m_a_cpu = -1;
-
-	mm->mm_sched_cpu = m_a_cpu;
+	if (m_a_occ > (2 * last_m_a_occ)) {
+		/* avoid the bouncing of mm_sched_cpu */
+		mm->mm_sched_cpu = m_a_cpu;
+	}
 
 	free_cpumask_var(cpus);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 06/20] sched: Save the per LLC utilization for better cache aware scheduling
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (4 preceding siblings ...)
  2025-06-18 18:27 ` [RFC patch v3 05/20] sched: Add hysteresis to switch a task's preferred LLC Tim Chen
@ 2025-06-18 18:27 ` Tim Chen
  2025-06-18 18:27 ` [RFC patch v3 07/20] sched: Add helper function to decide whether to allow " Tim Chen
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

When a system gets busy and a process's preferred LLC
is saturated by too many threads within this process, there are significant
in-LLC task migrations within its preferred LLC. This leads to migration
latency and degrades performance. Ideally, task aggregation should be
inhibited if the task's preferred LLC is overloaded. This implies that a
metric is needed to indicate whether the LLC is busy.

Store the per-LLC utilization calculated via periodic load
balancing. These statistics will be used in subsequent patches to
determine whether tasks should be aggregated to their preferred LLC.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |  3 ++
 kernel/sched/fair.c            | 53 ++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7b4301b7235f..b3115bc1cbc0 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -78,6 +78,9 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned long	util_avg;
+#endif
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7fb2322c5d9e..02f104414b9a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8806,6 +8806,22 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 #ifdef CONFIG_SCHED_CACHE
 static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
 
+/* expected to be protected by rcu_read_lock() */
+static bool get_llc_stats(int cpu, unsigned long *util,
+			  unsigned long *cap)
+{
+	struct sched_domain_shared *sd_share;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (!sd_share)
+		return false;
+
+	*util = READ_ONCE(sd_share->util_avg);
+	*cap = per_cpu(sd_llc_size, cpu) * SCHED_CAPACITY_SCALE;
+
+	return true;
+}
+
 static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct mm_struct *mm = p->mm;
@@ -10646,6 +10662,42 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
 
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Save this sched group's statistic for later use:
+ * The task wakeup and load balance can make better
+ * decision based on these statistics.
+ */
+static void update_sg_if_llc(struct lb_env *env, struct sg_lb_stats *sgs,
+			     struct sched_group *group)
+{
+	/* Find the sched domain that spans this group. */
+	struct sched_domain *sd = env->sd->child;
+	struct sched_domain_shared *sd_share;
+
+	if (!sched_feat(SCHED_CACHE) || env->idle == CPU_NEWLY_IDLE)
+		return;
+
+	/* only care the sched domain that spans 1 LLC */
+	if (!sd || !(sd->flags & SD_SHARE_LLC) ||
+	    !sd->parent || (sd->parent->flags & SD_SHARE_LLC))
+		return;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared,
+				   cpumask_first(sched_group_span(group))));
+	if (!sd_share)
+		return;
+
+	if (likely(READ_ONCE(sd_share->util_avg) != sgs->group_util))
+		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
+}
+#else
+static inline void update_sg_if_llc(struct lb_env *env, struct sg_lb_stats *sgs,
+				    struct sched_group *group)
+{
+}
+#endif
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10735,6 +10787,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
+	update_sg_if_llc(env, sgs, group);
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type == group_overloaded)
 		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (5 preceding siblings ...)
  2025-06-18 18:27 ` [RFC patch v3 06/20] sched: Save the per LLC utilization for better cache aware scheduling Tim Chen
@ 2025-06-18 18:27 ` Tim Chen
  2025-07-08  0:41   ` Libo Chen
  2025-06-18 18:27 ` [RFC patch v3 08/20] sched: Set up LLC indexing Tim Chen
                   ` (16 subsequent siblings)
  23 siblings, 1 reply; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

Cache-aware scheduling is designed to aggregate threads into their
preferred LLC, either via the task wake up path or the load balancing
path. One side effect is that when the preferred LLC is saturated,
more threads will continue to be stacked on it, degrading the workload's
latency. A strategy is needed to prevent this aggregation from going too
far such that the preferred LLC is too overloaded.

Introduce helper function _get_migrate_hint() to implement the LLC
migration policy:

1) A task is aggregated to its preferred LLC if both source/dest LLC
   are not too busy (<50% utilization, tunable), or the preferred
   LLC will not be too out of balanced from the non preferred LLC
   (>20% utilization, tunable, close to imbalance_pct of the LLC
   domain).
2) Allow a task to be moved from the preferred LLC to the
   non-preferred one if the non-preferred LLC will not be too out
   of balanced from the preferred prompting an aggregation task
   migration later.  We are still experimenting with the aggregation
   and migration policy. Some other possibilities are policy based
   on LLC's load or average number of tasks running.  Those could
   be tried out by tweaking _get_migrate_hint().

The function _get_migrate_hint() returns migration suggestions for the upper-level
functions.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/debug.c |   4 ++
 kernel/sched/fair.c  | 110 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   5 ++
 3 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 56ae54e0ce6a..7271ad1152af 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -532,6 +532,10 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
 #endif
 
+#ifdef CONFIG_SCHED_CACHE
+	debugfs_create_u32("llc_aggr_cap", 0644, debugfs_sched, &sysctl_llc_aggr_cap);
+	debugfs_create_u32("llc_aggr_imb", 0644, debugfs_sched, &sysctl_llc_aggr_imb);
+#endif
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
 	debugfs_fair_server_init();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f104414b9a..10ea408d0e40 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8804,7 +8804,39 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 }
 
 #ifdef CONFIG_SCHED_CACHE
-static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
+static long __migrate_degrades_locality(struct task_struct *p,
+					int src_cpu, int dst_cpu,
+					bool idle);
+__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
+__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
+
+/*
+ * The margin used when comparing LLC utilization with CPU capacity.
+ * Parameter sysctl_llc_aggr_cap determines the LLC load level where
+ * active LLC aggregation is done.
+ * Derived from fits_capacity().
+ *
+ * (default: ~50%)
+ */
+#define fits_llc_capacity(util, max)	\
+	((util) * 100 < (max) * sysctl_llc_aggr_cap)
+
+/*
+ * The margin used when comparing utilization.
+ * is 'util1' noticeably greater than 'util2'
+ * Derived from capacity_greater().
+ * Bias is in perentage.
+ */
+/* Allows dst util to be bigger than src util by up to bias percent */
+#define util_greater(util1, util2) \
+	((util1) * 100 > (util2) * (100 + sysctl_llc_aggr_imb))
+
+enum llc_mig_hint {
+	mig_allow = 0,
+	mig_ignore,
+	mig_forbid
+};
+
 
 /* expected to be protected by rcu_read_lock() */
 static bool get_llc_stats(int cpu, unsigned long *util,
@@ -8822,6 +8854,82 @@ static bool get_llc_stats(int cpu, unsigned long *util,
 	return true;
 }
 
+static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
+					   unsigned long tsk_util,
+					   bool to_pref)
+{
+	unsigned long src_util, dst_util, src_cap, dst_cap;
+
+	if (cpus_share_cache(src_cpu, dst_cpu))
+		return mig_allow;
+
+	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
+	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
+		return mig_allow;
+
+	if (!fits_llc_capacity(dst_util, dst_cap) &&
+	    !fits_llc_capacity(src_util, src_cap))
+		return mig_ignore;
+
+	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
+	dst_util = dst_util + tsk_util;
+	if (to_pref) {
+		/*
+		 * sysctl_llc_aggr_imb is the imbalance allowed between
+		 * preferred LLC and non-preferred LLC.
+		 * Don't migrate if we will get preferred LLC too
+		 * heavily loaded and if the dest is much busier
+		 * than the src, in which case migration will
+		 * increase the imbalance too much.
+		 */
+		if (!fits_llc_capacity(dst_util, dst_cap) &&
+		    util_greater(dst_util, src_util))
+			return mig_forbid;
+	} else {
+		/*
+		 * Don't migrate if we will leave preferred LLC
+		 * too idle, or if this migration leads to the
+		 * non-preferred LLC falls within sysctl_aggr_imb percent
+		 * of preferred LLC, leading to migration again
+		 * back to preferred LLC.
+		 */
+		if (fits_llc_capacity(src_util, src_cap) ||
+		    !util_greater(src_util, dst_util))
+			return mig_forbid;
+	}
+	return mig_allow;
+}
+
+/*
+ * Give suggestion when task p is migrated from src_cpu to dst_cpu.
+ */
+static __maybe_unused enum llc_mig_hint get_migrate_hint(int src_cpu, int dst_cpu,
+							 struct task_struct *p)
+{
+	struct mm_struct *mm;
+	int cpu;
+
+	if (cpus_share_cache(src_cpu, dst_cpu))
+		return mig_allow;
+
+	mm = p->mm;
+	if (!mm)
+		return mig_allow;
+
+	cpu = mm->mm_sched_cpu;
+	if (cpu < 0)
+		return mig_allow;
+
+	if (cpus_share_cache(dst_cpu, cpu))
+		return _get_migrate_hint(src_cpu, dst_cpu,
+					 task_util(p), true);
+	else if (cpus_share_cache(src_cpu, cpu))
+		return _get_migrate_hint(src_cpu, dst_cpu,
+					 task_util(p), false);
+	else
+		return mig_allow;
+}
+
 static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct mm_struct *mm = p->mm;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d16ccd66ca07..1c6fd45c7f62 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2818,6 +2818,11 @@ extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_hot_threshold;
 
+#ifdef CONFIG_SCHED_CACHE
+extern unsigned int sysctl_llc_aggr_cap;
+extern unsigned int sysctl_llc_aggr_imb;
+#endif
+
 #ifdef CONFIG_SCHED_HRTICK
 
 /*
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 08/20] sched: Set up LLC indexing
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (6 preceding siblings ...)
  2025-06-18 18:27 ` [RFC patch v3 07/20] sched: Add helper function to decide whether to allow " Tim Chen
@ 2025-06-18 18:27 ` Tim Chen
  2025-07-03 19:44   ` Shrikanth Hegde
  2025-06-18 18:27 ` [RFC patch v3 09/20] sched: Introduce task preferred LLC field Tim Chen
                   ` (15 subsequent siblings)
  23 siblings, 1 reply; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

Prepare for indexing arrays that track in each run queue: the number
of tasks preferring current LLC and each of the other LLC.

The reason to introduce LLC index is because the per LLC-scope data
is needed to do cache aware load balancing. However, the native lld_id
is usually the first CPU of that LLC domain, which is not continuous,
which might waste the space if the per LLC-scope data is stored
in an array (in current implementation).

In the future, this LLC index could be removed after
the native llc_id is used as the key to search into xarray based
array.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/sched.h   |  3 +++
 kernel/sched/fair.c     | 12 ++++++++++++
 kernel/sched/sched.h    |  2 ++
 kernel/sched/topology.c | 29 +++++++++++++++++++++++++++++
 4 files changed, 46 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d0e4cda2b3cd..7ce95a32e9ff 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -810,6 +810,9 @@ struct kmap_ctrl {
 #endif
 };
 
+/* XXX need fix to not use magic number */
+#define MAX_LLC 64
+
 struct task_struct {
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 10ea408d0e40..5549710d95cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1183,6 +1183,18 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
 
+/*
+ * continous index.
+ * TBD: replace by xarray with key llc_id()
+ */
+static inline int llc_idx(int cpu)
+{
+	if (cpu < 0)
+		return -1;
+
+	return per_cpu(sd_llc_idx, cpu);
+}
+
 void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
 {
 	unsigned long epoch;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1c6fd45c7f62..74eb2f3615aa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2037,6 +2037,7 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(int, sd_llc_idx);
 DECLARE_PER_CPU(int, sd_share_id);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
@@ -2045,6 +2046,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
 
 extern struct static_key_false sched_asym_cpucapacity;
 extern struct static_key_false sched_cluster_active;
+extern int max_llcs;
 
 static __always_inline bool sched_asym_cpucap_active(void)
 {
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f1ebc60d967f..b7bb13045dd8 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -672,6 +672,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_llc_idx);
 DEFINE_PER_CPU(int, sd_share_id);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
@@ -681,6 +682,25 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
 DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
 DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
 
+int max_llcs = -1;
+
+static void update_llc_idx(int cpu)
+{
+#ifdef CONFIG_SCHED_CACHE
+	int idx = -1, llc_id = -1;
+
+	llc_id = per_cpu(sd_llc_id, cpu);
+	idx = per_cpu(sd_llc_idx, llc_id);
+
+	if (idx < 0) {
+		idx = max_llcs++;
+		BUG_ON(idx > MAX_LLC);
+		per_cpu(sd_llc_idx, llc_id) = idx;
+	}
+	per_cpu(sd_llc_idx, cpu) = idx;
+#endif
+}
+
 static void update_top_cache_domain(int cpu)
 {
 	struct sched_domain_shared *sds = NULL;
@@ -699,6 +719,7 @@ static void update_top_cache_domain(int cpu)
 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
+	update_llc_idx(cpu);
 
 	sd = lowest_flag_domain(cpu, SD_CLUSTER);
 	if (sd)
@@ -2394,6 +2415,14 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	bool has_asym = false;
 	bool has_cluster = false;
 
+#ifdef CONFIG_SCHED_CACHE
+	if (max_llcs < 0) {
+		for_each_possible_cpu(i)
+			per_cpu(sd_llc_idx, i) = -1;
+		max_llcs = 0;
+	}
+#endif
+
 	if (WARN_ON(cpumask_empty(cpu_map)))
 		goto error;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 09/20] sched: Introduce task preferred LLC field
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (7 preceding siblings ...)
  2025-06-18 18:27 ` [RFC patch v3 08/20] sched: Set up LLC indexing Tim Chen
@ 2025-06-18 18:27 ` Tim Chen
  2025-06-18 18:27 ` [RFC patch v3 10/20] sched: Calculate the number of tasks that have LLC preference on a runqueue Tim Chen
                   ` (14 subsequent siblings)
  23 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

With cache aware scheduling enabled, each process is assigned
a preferred LLC id, which will be used to quickly identify
the LLC domain this thread prefers to run. This is similar to
numa_preferred_nid for NUMA balance.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/sched.h | 1 +
 init/init_task.c      | 3 +++
 kernel/sched/fair.c   | 7 +++++++
 3 files changed, 11 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7ce95a32e9ff..2f1cb7445733 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1404,6 +1404,7 @@ struct task_struct {
 
 #ifdef CONFIG_SCHED_CACHE
 	struct callback_head		cache_work;
+	int				preferred_llc;
 #endif
 
 #ifdef CONFIG_RSEQ
diff --git a/init/init_task.c b/init/init_task.c
index e557f622bd90..5fffbe766f57 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -188,6 +188,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.numa_group	= NULL,
 	.numa_faults	= NULL,
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	.preferred_llc  = -1,
+#endif
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 	.kasan_depth	= 1,
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5549710d95cf..cc804a8c7061 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1267,6 +1267,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	struct mm_struct *mm = p->mm;
 	struct mm_sched *pcpu_sched;
 	unsigned long epoch;
+	int mm_sched_llc = -1;
 
 	/*
 	 * init_task and kthreads don't be having no mm
@@ -1293,6 +1294,12 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 		mm->mm_sched_cpu = -1;
 		pcpu_sched->occ = 0;
 	}
+
+	if (mm->mm_sched_cpu != -1)
+		mm_sched_llc = per_cpu(sd_llc_id, mm->mm_sched_cpu);
+
+	if (p->preferred_llc != mm_sched_llc)
+		p->preferred_llc = mm_sched_llc;
 }
 
 static void task_tick_cache(struct rq *rq, struct task_struct *p)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 10/20] sched: Calculate the number of tasks that have LLC preference on a runqueue
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (8 preceding siblings ...)
  2025-06-18 18:27 ` [RFC patch v3 09/20] sched: Introduce task preferred LLC field Tim Chen
@ 2025-06-18 18:27 ` Tim Chen
  2025-07-03 19:45   ` Shrikanth Hegde
  2025-06-18 18:27 ` [RFC patch v3 11/20] sched: Introduce per runqueue task LLC preference counter Tim Chen
                   ` (13 subsequent siblings)
  23 siblings, 1 reply; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

Track for each run queue, the number of tasks that have a LLC preference
and how many of those tasks are running in its preferred LLC.  This is
similar to nr_numa_running and nr_preferred_running for NUMA balance,
and will be used by the cache-aware load balancing in subsequent patches.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/core.c  | 12 ++++++++++++
 kernel/sched/fair.c  | 42 +++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  7 +++++++
 3 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d9c3e75f79d1..34056eb79ef2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -498,6 +498,18 @@ void __trace_set_current_state(int state_value)
 }
 EXPORT_SYMBOL(__trace_set_current_state);
 
+#ifdef CONFIG_SMP
+int task_llc(const struct task_struct *p)
+{
+	return per_cpu(sd_llc_id, task_cpu(p));
+}
+#else
+int task_llc(const struct task_struct *p)
+{
+	return 0;
+}
+#endif
+
 /*
  * Serialization rules:
  *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cc804a8c7061..88ff47194faa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1195,6 +1195,18 @@ static inline int llc_idx(int cpu)
 	return per_cpu(sd_llc_idx, cpu);
 }
 
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_llc_running += (p->preferred_llc != -1);
+	rq->nr_pref_llc_running += (p->preferred_llc == task_llc(p));
+}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
+{
+	rq->nr_llc_running -= (p->preferred_llc != -1);
+	rq->nr_pref_llc_running -= (p->preferred_llc == task_llc(p));
+}
+
 void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
 {
 	unsigned long epoch;
@@ -1298,8 +1310,11 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	if (mm->mm_sched_cpu != -1)
 		mm_sched_llc = per_cpu(sd_llc_id, mm->mm_sched_cpu);
 
-	if (p->preferred_llc != mm_sched_llc)
+	if (p->preferred_llc != mm_sched_llc) {
+		account_llc_dequeue(rq, p);
 		p->preferred_llc = mm_sched_llc;
+		account_llc_enqueue(rq, p);
+	}
 }
 
 static void task_tick_cache(struct rq *rq, struct task_struct *p)
@@ -1400,6 +1415,14 @@ void init_sched_mm(struct task_struct *p)
 	work->next = work;
 }
 
+void reset_llc_stats(struct rq *rq)
+{
+	if (rq->nr_llc_running)
+		rq->nr_llc_running = 0;
+
+	rq->nr_pref_llc_running = 0;
+}
+
 #else
 
 static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
@@ -1410,6 +1433,17 @@ void init_sched_mm(struct task_struct *p) { }
 
 static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
 
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
+
+void reset_llc_stats(struct rq *rq)
+{
+}
 #endif
 
 static inline
@@ -3939,6 +3973,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		struct rq *rq = rq_of(cfs_rq);
 
 		account_numa_enqueue(rq, task_of(se));
+		account_llc_enqueue(rq, task_of(se));
 		list_add(&se->group_node, &rq->cfs_tasks);
 	}
 #endif
@@ -3952,10 +3987,15 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #ifdef CONFIG_SMP
 	if (entity_is_task(se)) {
 		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+		account_llc_dequeue(rq_of(cfs_rq), task_of(se));
 		list_del_init(&se->group_node);
 	}
 #endif
 	cfs_rq->nr_queued--;
+
+	/* safeguard? */
+	if (!parent_entity(se) && !cfs_rq->nr_queued)
+		reset_llc_stats(rq_of(cfs_rq));
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 74eb2f3615aa..6c83a71ac8ca 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1104,6 +1104,10 @@ struct rq {
 	unsigned int		nr_preferred_running;
 	unsigned int		numa_migrate_on;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int		nr_pref_llc_running;
+	unsigned int		nr_llc_running;
+#endif
 #ifdef CONFIG_NO_HZ_COMMON
 #ifdef CONFIG_SMP
 	unsigned long		last_blocked_load_update_tick;
@@ -1948,6 +1952,9 @@ init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 
 #endif /* !CONFIG_NUMA_BALANCING */
 
+extern void reset_llc_stats(struct rq *rq);
+extern int task_llc(const struct task_struct *p);
+
 #ifdef CONFIG_SMP
 
 static inline void
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 11/20] sched: Introduce per runqueue task LLC preference counter
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (9 preceding siblings ...)
  2025-06-18 18:27 ` [RFC patch v3 10/20] sched: Calculate the number of tasks that have LLC preference on a runqueue Tim Chen
@ 2025-06-18 18:27 ` Tim Chen
  2025-06-18 18:28 ` [RFC patch v3 12/20] sched: Calculate the total number of preferred LLC tasks during load balance Tim Chen
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:27 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

Each runqueue is assigned a static array, where each element indicates
the number of tasks preferring a particular LLC mapped to the
array index.

For example, rq->nr_pref_llc[3] = 2 signifies that there are 2 tasks on
this runqueue which prefer to run within LLC3 (indexed from 0 to MAX_LLC
across the entire system). With this information, the load balancer can
make better decisions to select the busiest runqueue and migrate tasks
to their preferred LLC domains.

Note: The static array could be converted to an xarray in the future.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c  | 36 +++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 88ff47194faa..ba62b445bbbb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1195,16 +1195,45 @@ static inline int llc_idx(int cpu)
 	return per_cpu(sd_llc_idx, cpu);
 }
 
+static inline int pref_llc_idx(struct task_struct *p)
+{
+	return llc_idx(p->preferred_llc);
+}
+
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
+	int pref_llc;
+
 	rq->nr_llc_running += (p->preferred_llc != -1);
 	rq->nr_pref_llc_running += (p->preferred_llc == task_llc(p));
+
+	if (p->preferred_llc < 0)
+		return;
+
+	pref_llc = pref_llc_idx(p);
+	if (pref_llc < 0)
+		return;
+
+	++rq->nr_pref_llc[pref_llc];
 }
 
 static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
 {
+	int pref_llc;
+
 	rq->nr_llc_running -= (p->preferred_llc != -1);
 	rq->nr_pref_llc_running -= (p->preferred_llc == task_llc(p));
+
+	if (p->preferred_llc < 0)
+		return;
+
+	pref_llc = pref_llc_idx(p);
+	if (pref_llc < 0)
+		return;
+
+	/* avoid negative counter */
+	if (rq->nr_pref_llc[pref_llc] > 0)
+		--rq->nr_pref_llc[pref_llc];
 }
 
 void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
@@ -1417,8 +1446,13 @@ void init_sched_mm(struct task_struct *p)
 
 void reset_llc_stats(struct rq *rq)
 {
-	if (rq->nr_llc_running)
+	int i;
+
+	if (rq->nr_llc_running) {
+		for (i = 0; i < MAX_LLC; ++i)
+			rq->nr_pref_llc[i] = 0;
 		rq->nr_llc_running = 0;
+	}
 
 	rq->nr_pref_llc_running = 0;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6c83a71ac8ca..391ddc0195f8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1107,6 +1107,7 @@ struct rq {
 #ifdef CONFIG_SCHED_CACHE
 	unsigned int		nr_pref_llc_running;
 	unsigned int		nr_llc_running;
+	unsigned int		nr_pref_llc[MAX_LLC];
 #endif
 #ifdef CONFIG_NO_HZ_COMMON
 #ifdef CONFIG_SMP
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 12/20] sched: Calculate the total number of preferred LLC tasks during load balance
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (10 preceding siblings ...)
  2025-06-18 18:27 ` [RFC patch v3 11/20] sched: Introduce per runqueue task LLC preference counter Tim Chen
@ 2025-06-18 18:28 ` Tim Chen
  2025-06-18 18:28 ` [RFC patch v3 13/20] sched: Tag the sched group as llc_balance if it has tasks prefer other LLC Tim Chen
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

During load balancing between LLCs, gather the number of tasks
on each runqueue of a source LLC.

For example, consider a system with 4 sched groups LLC0, LLC1,
..., LLC3. We are balancing towards LLC3 and LLC0 has 3 tasks
preferring LLC3, LLC1 has 2 tasks preferring LLC3 and LLC2 has
1 task preferring LLC3. LLC0 with most tasks preferring LLC3
will be chosen as the busiest LLC to pick the tasks from.

The number of tasks preferring the destination LLC are gathered
from each run queue for a source LLC.

For example, consider the sched_group LLC0 with two CPUs, CPU0
and CPU1. On CPU0, 2 tasks prefer to run on LLC3, and on CPU1,
one task prefers LLC3. The total number of tasks preferring
LLC3 in LLC0 is 2 + 1 = 3.

These statistics enable the load balancer to select tasks from
a sched_group that best aligns tasks with their preferred LLCs.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba62b445bbbb..99f3cee7b276 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10459,6 +10459,9 @@ struct sg_lb_stats {
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int nr_pref_llc[MAX_LLC];
+#endif
 };

 /*
@@ -10937,6 +10940,14 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if (cpu_overutilized(i))
 			*sg_overutilized = 1;

+#ifdef CONFIG_SCHED_CACHE
+		if (sched_feat(SCHED_CACHE)) {
+			int j;
+
+			for (j = 0; j < max_llcs; ++j)
+				sgs->nr_pref_llc[j] += rq->nr_pref_llc[j];
+		}
+#endif
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
-- 
2.32.0

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 13/20] sched: Tag the sched group as llc_balance if it has tasks prefer other LLC
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (11 preceding siblings ...)
  2025-06-18 18:28 ` [RFC patch v3 12/20] sched: Calculate the total number of preferred LLC tasks during load balance Tim Chen
@ 2025-06-18 18:28 ` Tim Chen
  2025-06-18 18:28 ` [RFC patch v3 14/20] sched: Introduce update_llc_busiest() to deal with groups having preferred LLC tasks Tim Chen
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

During load balancing between LLCs, check whether there are tasks
preferring the destination LLC. If so, balance those tasks to the
destination LLC first.

Tag the sched_group that has tasks preferring to run on other LLCs
(non-local) with the group_llc_balance flag. This way, the load
balancer will later attempt to pull/push these tasks to their
preferred LLCs.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99f3cee7b276..48a090c6e885 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10454,6 +10454,7 @@ struct sg_lb_stats {
 	enum group_type group_type;
 	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
 	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
+	unsigned int group_llc_balance;		/* Tasks should be moved to preferred LLC */
 	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -10818,6 +10819,43 @@ static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs,
 	return false;
 }
 
+/*
+ * Do LLC balance on sched group that contains LLC, and have tasks preferring
+ * to run on LLC in idle dst_cpu.
+ */
+#ifdef CONFIG_SCHED_CACHE
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+			       struct sched_group *group)
+{
+	struct sched_domain *child = env->sd->child;
+	int llc;
+
+	if (!sched_feat(SCHED_CACHE))
+		return false;
+
+	if (env->sd->flags & SD_SHARE_LLC)
+		return false;
+
+	/* only care about task migration among LLCs */
+	if (child && !(child->flags & SD_SHARE_LLC))
+		return false;
+
+	llc = llc_idx(env->dst_cpu);
+	if (sgs->nr_pref_llc[llc] > 0 &&
+	    _get_migrate_hint(env->src_cpu, env->dst_cpu,
+			      0, true) == mig_allow)
+		return true;
+
+	return false;
+}
+#else
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+			       struct sched_group *group)
+{
+	return false;
+}
+#endif
+
 static inline long sibling_imbalance(struct lb_env *env,
 				    struct sd_lb_stats *sds,
 				    struct sg_lb_stats *busiest,
@@ -11000,6 +11038,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
 	update_sg_if_llc(env, sgs, group);
+
+	/* Check for tasks in this group can be moved to their preferred LLC */
+	if (!local_group && llc_balance(env, sgs, group))
+		sgs->group_llc_balance = 1;
+
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type == group_overloaded)
 		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 14/20] sched: Introduce update_llc_busiest() to deal with groups having preferred LLC tasks
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (12 preceding siblings ...)
  2025-06-18 18:28 ` [RFC patch v3 13/20] sched: Tag the sched group as llc_balance if it has tasks prefer other LLC Tim Chen
@ 2025-06-18 18:28 ` Tim Chen
  2025-07-03 19:52   ` Shrikanth Hegde
  2025-06-18 18:28 ` [RFC patch v3 15/20] sched: Introduce a new migration_type to track the preferred LLC load balance Tim Chen
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

The load balancer attempts to identify the busiest sched_group with
the highest load and migrates some tasks to a less busy sched_group
to distribute the load across different CPUs.

When cache-aware scheduling is enabled, the busiest sched_group is
defined as the one with the highest number of tasks preferring to run
on the destination LLC. If the busiest group has llc_balance tag,
the cache aware load balance will be launched.

Introduce the helper function update_llc_busiest() to identify
such sched group with most tasks preferring the destination LLC.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48a090c6e885..ab3d1239d6e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10848,12 +10848,36 @@ static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
 
 	return false;
 }
+
+static bool update_llc_busiest(struct lb_env *env,
+			       struct sg_lb_stats *busiest,
+			       struct sg_lb_stats *sgs)
+{
+	int idx;
+
+	/* Only the candidate with llc_balance need to be taken care of */
+	if (!sgs->group_llc_balance)
+		return false;
+
+	/*
+	 * There are more tasks that want to run on dst_cpu's LLC.
+	 */
+	idx = llc_idx(env->dst_cpu);
+	return sgs->nr_pref_llc[idx] > busiest->nr_pref_llc[idx];
+}
 #else
 static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
 			       struct sched_group *group)
 {
 	return false;
 }
+
+static bool update_llc_busiest(struct lb_env *env,
+			       struct sg_lb_stats *busiest,
+			       struct sg_lb_stats *sgs)
+{
+	return false;
+}
 #endif
 
 static inline long sibling_imbalance(struct lb_env *env,
@@ -11085,6 +11109,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
+	/* deal with prefer LLC load balance, if failed, fall into normal load balance */
+	if (update_llc_busiest(env, busiest, sgs))
+		return true;
+
+	/* if there is already a busy group, skip the normal load balance */
+	if (busiest->group_llc_balance)
+		return false;
+
 	if (sgs->group_type > busiest->group_type)
 		return true;
 
@@ -11991,9 +12023,11 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
 	/*
 	 * Try to move all excess tasks to a sibling domain of the busiest
 	 * group's child domain.
+	 * Also do so if we can move some tasks that prefer the local LLC.
 	 */
 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
-	    sibling_imbalance(env, &sds, busiest, local) > 1)
+	    (busiest->group_llc_balance ||
+	    sibling_imbalance(env, &sds, busiest, local) > 1))
 		goto force_balance;
 
 	if (busiest->group_type != group_overloaded) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 15/20] sched: Introduce a new migration_type to track the preferred LLC load balance
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (13 preceding siblings ...)
  2025-06-18 18:28 ` [RFC patch v3 14/20] sched: Introduce update_llc_busiest() to deal with groups having preferred LLC tasks Tim Chen
@ 2025-06-18 18:28 ` Tim Chen
  2025-06-18 18:28 ` [RFC patch v3 16/20] sched: Consider LLC locality for active balance Tim Chen
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

Introduce a new migration type named migrate_llc_task to facilitate
cache-aware load balancing.

After the busiest sched_group is identified as the one that needs
migration due to having most tasks preferring destination LLC, tag the
migration type as the newly introduced migrate_llc_task. During load
balancing, each runqueue within the busiest preferred-LLC sched_group
is checked, and the runqueue with the highest number of tasks preferring
to run on the destination CPU is chosen as the busiest runqueue.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab3d1239d6e4..42222364ad9c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9711,7 +9711,8 @@ enum migration_type {
 	migrate_load = 0,
 	migrate_util,
 	migrate_task,
-	migrate_misfit
+	migrate_misfit,
+	migrate_llc_task
 };
 
 #define LBF_ALL_PINNED	0x01
@@ -10143,6 +10144,15 @@ static int detach_tasks(struct lb_env *env)
 			env->imbalance -= util;
 			break;
 
+		case migrate_llc_task:
+			/*
+			 * Since can_migrate_task() succeed, when we reach here, it means that p
+			 * can be migrated even if dst_cpu is not p's preferred_llc, because there
+			 * are no idle cores for p to do in-llc load balance.
+			 */
+			env->imbalance--;
+			break;
+
 		case migrate_task:
 			env->imbalance--;
 			break;
@@ -11779,6 +11789,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		return;
 	}
 
+#ifdef CONFIG_SCHED_CACHE
+	if (busiest->group_llc_balance) {
+		/* Move a task that prefer local LLC */
+		env->migration_type = migrate_llc_task;
+		env->imbalance = 1;
+		return;
+	}
+#endif
+
 	if (busiest->group_type == group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -12087,6 +12106,10 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 	struct rq *busiest = NULL, *rq;
 	unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
 	unsigned int busiest_nr = 0;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int busiest_pref_llc = 0;
+	int dst_llc;
+#endif
 	int i;
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
@@ -12195,6 +12218,16 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 			}
 			break;
 
+		case migrate_llc_task:
+#ifdef CONFIG_SCHED_CACHE
+			dst_llc = llc_idx(env->dst_cpu);
+			if (!cpus_share_cache(env->dst_cpu, rq->cpu) &&
+			    busiest_pref_llc < rq->nr_pref_llc[dst_llc]) {
+				busiest_pref_llc = rq->nr_pref_llc[dst_llc];
+				busiest = rq;
+			}
+#endif
+			break;
 		case migrate_task:
 			if (busiest_nr < nr_running) {
 				busiest_nr = nr_running;
@@ -12377,6 +12410,8 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
 	case migrate_misfit:
 		__schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
 		break;
+	case migrate_llc_task:
+		break;
 	}
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 16/20] sched: Consider LLC locality for active balance
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (14 preceding siblings ...)
  2025-06-18 18:28 ` [RFC patch v3 15/20] sched: Introduce a new migration_type to track the preferred LLC load balance Tim Chen
@ 2025-06-18 18:28 ` Tim Chen
  2025-06-18 18:28 ` [RFC patch v3 17/20] sched: Consider LLC preference when picking tasks from busiest queue Tim Chen
                   ` (7 subsequent siblings)
  23 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

If busiest run queue has only one task, active balance is enlisted
to actually move the task.  However, before moving the task,
we should consider whether we are moving the task from its preferred
LLC.

Don't move the single running task in a run queue to another LLC, if
we are moving it from its desired LLC, or moving it will cause too much
imbalance between the LLCs.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 51 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 48 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 42222364ad9c..3a8f6fc52055 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12294,10 +12294,43 @@ imbalanced_active_balance(struct lb_env *env)
 	return 0;
 }
 
+#ifdef CONFIG_SCHED_CACHE
+static inline bool
+break_llc_locality(struct lb_env *env)
+{
+	if (!sched_feat(SCHED_CACHE))
+		return 0;
+
+	if (cpus_share_cache(env->src_cpu, env->dst_cpu))
+		return 0;
+	/*
+	 * All tasks want to stay put. Move only if LLC is
+	 * heavily loaded or don't pull a task from its
+	 * preferred CPU if it is the only one running.
+	 */
+	if (env->src_rq->nr_pref_llc_running == env->src_rq->cfs.h_nr_runnable &&
+	    (env->src_rq->nr_running <= 1 ||
+	    _get_migrate_hint(env->src_cpu, env->dst_cpu,
+			      0, false) == mig_forbid))
+		return 1;
+
+	return 0;
+}
+#else
+static inline bool
+break_llc_locality(struct lb_env *env)
+{
+	return 0;
+}
+#endif
+
 static int need_active_balance(struct lb_env *env)
 {
 	struct sched_domain *sd = env->sd;
 
+	if (break_llc_locality(env))
+		return 0;
+
 	if (asym_active_balance(env))
 		return 1;
 
@@ -12317,7 +12350,8 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
-	if (env->migration_type == migrate_misfit)
+	if (env->migration_type == migrate_misfit ||
+	    env->migration_type == migrate_llc_task)
 		return 1;
 
 	return 0;
@@ -12762,9 +12796,20 @@ static int active_load_balance_cpu_stop(void *data)
 		goto out_unlock;
 
 	/* Is there any task to move? */
-	if (busiest_rq->nr_running <= 1)
-		goto out_unlock;
+	if (busiest_rq->nr_running <= 1) {
+#ifdef CONFIG_SCHED_CACHE
+		int llc = llc_idx(target_cpu);
 
+		if (!sched_feat(SCHED_CACHE))
+			goto out_unlock;
+
+		if (llc < 0)
+			goto out_unlock;
+		/* don't migrate if task does not prefer target */
+		if (busiest_rq->nr_pref_llc[llc] < 1)
+#endif
+			goto out_unlock;
+	}
 	/*
 	 * This condition is "impossible", if it occurs
 	 * we need to fix it. Originally reported by
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 17/20] sched: Consider LLC preference when picking tasks from busiest queue
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (15 preceding siblings ...)
  2025-06-18 18:28 ` [RFC patch v3 16/20] sched: Consider LLC locality for active balance Tim Chen
@ 2025-06-18 18:28 ` Tim Chen
  2025-06-18 18:28 ` [RFC patch v3 18/20] sched: Do not migrate task if it is moving out of its preferred LLC Tim Chen
                   ` (6 subsequent siblings)
  23 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

When picking tasks from busiest queue for load balance, we currently
do not consider LLC preference.

Order the task in the busiest queue such that we picked the tasks in the
following order:
	1. tasks that prefer dst cpu's LLC
	2. tasks that have no preference in LLC
	3. tasks that prefer LLC other than the ones they are on
	4. tasks that prefer the LLC that they are currently on

This will allow tasks better chances to wind up in its preferred LLC.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a8f6fc52055..c9db32c2df63 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10056,6 +10056,68 @@ static struct task_struct *detach_one_task(struct lb_env *env)
 	return NULL;
 }
 
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Prepare lists to detach tasks in the following order:
+ * 1. tasks that prefer dst cpu's LLC
+ * 2. tasks that have no preference in LLC
+ * 3. tasks that prefer LLC other than the ones they are on
+ * 4. tasks that prefer the LLC that they are currently on.
+ */
+static struct list_head
+*order_tasks_by_llc(struct lb_env *env, struct list_head *tasks)
+{
+	struct task_struct *p;
+	LIST_HEAD(pref_old_llc);
+	LIST_HEAD(pref_new_llc);
+	LIST_HEAD(no_pref_llc);
+	LIST_HEAD(pref_other_llc);
+
+	if (!sched_feat(SCHED_CACHE))
+		return tasks;
+
+	if (cpus_share_cache(env->dst_cpu, env->src_cpu))
+		return tasks;
+
+	while (!list_empty(tasks)) {
+		p = list_last_entry(tasks, struct task_struct, se.group_node);
+
+		if (p->preferred_llc == llc_id(env->dst_cpu)) {
+			list_move(&p->se.group_node, &pref_new_llc);
+			continue;
+		}
+
+		if (p->preferred_llc == llc_id(env->src_cpu)) {
+			list_move(&p->se.group_node, &pref_old_llc);
+			continue;
+		}
+
+		if (p->preferred_llc == -1) {
+			list_move(&p->se.group_node, &no_pref_llc);
+			continue;
+		}
+
+		list_move(&p->se.group_node, &pref_other_llc);
+	}
+
+	/*
+	 * We detach tasks from list tail in detach tasks.  Put tasks
+	 * to be chosen first at end of list.
+	 */
+	list_splice(&pref_new_llc, tasks);
+	list_splice(&no_pref_llc, tasks);
+	list_splice(&pref_other_llc, tasks);
+	list_splice(&pref_old_llc, tasks);
+	return tasks;
+}
+#else
+static inline struct list_head
+*order_tasks_by_llc(struct lb_env *env, struct list_head *tasks)
+{
+	return tasks;
+}
+#endif
+
 /*
  * detach_tasks() -- tries to detach up to imbalance load/util/tasks from
  * busiest_rq, as part of a balancing operation within domain "sd".
@@ -10064,7 +10126,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
  */
 static int detach_tasks(struct lb_env *env)
 {
-	struct list_head *tasks = &env->src_rq->cfs_tasks;
+	struct list_head *tasks;
 	unsigned long util, load;
 	struct task_struct *p;
 	int detached = 0;
@@ -10083,6 +10145,8 @@ static int detach_tasks(struct lb_env *env)
 	if (env->imbalance <= 0)
 		return 0;
 
+	tasks = order_tasks_by_llc(env, &env->src_rq->cfs_tasks);
+
 	while (!list_empty(tasks)) {
 		/*
 		 * We don't want to steal all, otherwise we may be treated likewise,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 18/20] sched: Do not migrate task if it is moving out of its preferred LLC
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (16 preceding siblings ...)
  2025-06-18 18:28 ` [RFC patch v3 17/20] sched: Consider LLC preference when picking tasks from busiest queue Tim Chen
@ 2025-06-18 18:28 ` Tim Chen
  2025-06-18 18:28 ` [RFC patch v3 19/20] sched: Introduce SCHED_CACHE_LB to control cache aware load balance Tim Chen
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

In the final step of task migration during load balancing,
can_migrate_task() is used to determine whether a task can
be moved to the destination. If the task has an LLC preference,
consider this preference when moving it out of its preferred LLC.
With this check in place, there is no need to retain the task's
cache-hot CPU check in task_hot(); remove it accordingly.

Besides, add more checks in detach_tasks() to avoid choosing
tasks that prefer their current LLC.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9db32c2df63..e342524481ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9787,17 +9787,6 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 	if (sysctl_sched_migration_cost == 0)
 		return 0;
 
-#ifdef CONFIG_SCHED_CACHE
-	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->pcpu_sched) {
-		/*
-		 * XXX things like Skylake have non-inclusive L3 and might not
-		 * like this L3 centric view. What to do about L2 stickyness ?
-		 */
-		return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
-		       per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
-	}
-#endif
-
 	delta = rq_clock_task(env->src_rq) - p->se.exec_start;
 
 	return delta < (s64)sysctl_sched_migration_cost;
@@ -9992,6 +9981,12 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (env->flags & LBF_ACTIVE_LB)
 		return 1;
 
+#ifdef CONFIG_SCHED_CACHE
+	if (sched_feat(SCHED_CACHE) &&
+	    get_migrate_hint(env->src_cpu, env->dst_cpu, p) == mig_forbid)
+		return 0;
+#endif
+
 	degrades = migrate_degrades_locality(p, env);
 	if (!degrades)
 		hot = task_hot(p, env);
@@ -10252,6 +10247,17 @@ static int detach_tasks(struct lb_env *env)
 		if (env->imbalance <= 0)
 			break;
 
+#ifdef CONFIG_SCHED_CACHE
+		/*
+		 * Don't detach more tasks if remaining tasks want to stay:
+		 * The tasks have already been sorted by order_tasks_by_llc(),
+		 * they are tasks that prefer the current LLC.
+		 */
+		if (sched_feat(SCHED_CACHE) && p->preferred_llc != -1 &&
+		    llc_id(env->src_cpu) == p->preferred_llc)
+			break;
+#endif
+
 		continue;
 next:
 		if (p->sched_task_hot)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 19/20] sched: Introduce SCHED_CACHE_LB to control cache aware load balance
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (17 preceding siblings ...)
  2025-06-18 18:28 ` [RFC patch v3 18/20] sched: Do not migrate task if it is moving out of its preferred LLC Tim Chen
@ 2025-06-18 18:28 ` Tim Chen
  2025-06-18 18:28 ` [RFC patch v3 20/20] sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake up Tim Chen
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

Introduce the SCHED_CACHE_LB sched feature to enable or disable
cache aware load balance in the schduler.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c     | 18 ++++++++++--------
 kernel/sched/features.h |  1 +
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e342524481ed..af742601f2d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9982,7 +9982,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		return 1;
 
 #ifdef CONFIG_SCHED_CACHE
-	if (sched_feat(SCHED_CACHE) &&
+	if (sched_feat(SCHED_CACHE) && sched_feat(SCHED_CACHE_LB) &&
 	    get_migrate_hint(env->src_cpu, env->dst_cpu, p) == mig_forbid)
 		return 0;
 #endif
@@ -10068,7 +10068,7 @@ static struct list_head
 	LIST_HEAD(no_pref_llc);
 	LIST_HEAD(pref_other_llc);
 
-	if (!sched_feat(SCHED_CACHE))
+	if (!sched_feat(SCHED_CACHE) || !sched_feat(SCHED_CACHE_LB))
 		return tasks;
 
 	if (cpus_share_cache(env->dst_cpu, env->src_cpu))
@@ -10253,7 +10253,8 @@ static int detach_tasks(struct lb_env *env)
 		 * The tasks have already been sorted by order_tasks_by_llc(),
 		 * they are tasks that prefer the current LLC.
 		 */
-		if (sched_feat(SCHED_CACHE) && p->preferred_llc != -1 &&
+		if (sched_feat(SCHED_CACHE) && sched_feat(SCHED_CACHE_LB) &&
+		    p->preferred_llc != -1 &&
 		    llc_id(env->src_cpu) == p->preferred_llc)
 			break;
 #endif
@@ -10910,7 +10911,7 @@ static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
 	struct sched_domain *child = env->sd->child;
 	int llc;
 
-	if (!sched_feat(SCHED_CACHE))
+	if (!sched_feat(SCHED_CACHE) || !sched_feat(SCHED_CACHE_LB))
 		return false;
 
 	if (env->sd->flags & SD_SHARE_LLC)
@@ -11021,7 +11022,8 @@ static void update_sg_if_llc(struct lb_env *env, struct sg_lb_stats *sgs,
 	struct sched_domain *sd = env->sd->child;
 	struct sched_domain_shared *sd_share;
 
-	if (!sched_feat(SCHED_CACHE) || env->idle == CPU_NEWLY_IDLE)
+	if (!sched_feat(SCHED_CACHE) || env->idle == CPU_NEWLY_IDLE ||
+	    !sched_feat(SCHED_CACHE_LB))
 		return;
 
 	/* only care the sched domain that spans 1 LLC */
@@ -11083,7 +11085,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			*sg_overutilized = 1;
 
 #ifdef CONFIG_SCHED_CACHE
-		if (sched_feat(SCHED_CACHE)) {
+		if (sched_feat(SCHED_CACHE) && sched_feat(SCHED_CACHE_LB)) {
 			int j;
 
 			for (j = 0; j < max_llcs; ++j)
@@ -12368,7 +12370,7 @@ imbalanced_active_balance(struct lb_env *env)
 static inline bool
 break_llc_locality(struct lb_env *env)
 {
-	if (!sched_feat(SCHED_CACHE))
+	if (!sched_feat(SCHED_CACHE) || !sched_feat(SCHED_CACHE_LB))
 		return 0;
 
 	if (cpus_share_cache(env->src_cpu, env->dst_cpu))
@@ -12870,7 +12872,7 @@ static int active_load_balance_cpu_stop(void *data)
 #ifdef CONFIG_SCHED_CACHE
 		int llc = llc_idx(target_cpu);
 
-		if (!sched_feat(SCHED_CACHE))
+		if (!sched_feat(SCHED_CACHE) || !sched_feat(SCHED_CACHE_LB))
 			goto out_unlock;
 
 		if (llc < 0)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d2af7bfd36bf..11dbd74cd365 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -88,6 +88,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
 SCHED_FEAT(SIS_UTIL, true)
 
 SCHED_FEAT(SCHED_CACHE, true)
+SCHED_FEAT(SCHED_CACHE_LB, true)
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
  * in a single rq->lock section. Default disabled because the
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [RFC patch v3 20/20] sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake up
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (18 preceding siblings ...)
  2025-06-18 18:28 ` [RFC patch v3 19/20] sched: Introduce SCHED_CACHE_LB to control cache aware load balance Tim Chen
@ 2025-06-18 18:28 ` Tim Chen
  2025-06-19  6:39 ` [RFC patch v3 00/20] Cache aware scheduling Yangyu Chen
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-18 18:28 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Chen Yu

Introduce SCHED_CACHE_WAKE feature to enable or disable cache-aware
wake up. Disable this feature by default because cache-aware wakeup
is overly aggressive in stacking wakees of the same process on the same LLC,
if they are frequently woken up.

The wake ups can be much more frequent than load balances, adding
much overhead when load balance alone for LLC aggregation is sufficient.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c     | 6 +++++-
 kernel/sched/features.h | 1 +
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index af742601f2d7..32c90fab0d63 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9028,7 +9028,7 @@ static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 	struct mm_struct *mm = p->mm;
 	int cpu;
 
-	if (!sched_feat(SCHED_CACHE))
+	if (!sched_feat(SCHED_CACHE) || !sched_feat(SCHED_CACHE_WAKE))
 		return prev_cpu;
 
 	if (!mm || p->nr_cpus_allowed == 1)
@@ -9041,6 +9041,10 @@ static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 	if (cpus_share_cache(cpu, prev_cpu))
 		return prev_cpu;
 
+	if (_get_migrate_hint(prev_cpu, cpu,
+			      task_util(p), true) == mig_forbid)
+		return prev_cpu;
+
 	if (static_branch_likely(&sched_numa_balancing) &&
 	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
 		/*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 11dbd74cd365..44b408cf0dd4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -89,6 +89,7 @@ SCHED_FEAT(SIS_UTIL, true)
 
 SCHED_FEAT(SCHED_CACHE, true)
 SCHED_FEAT(SCHED_CACHE_LB, true)
+SCHED_FEAT(SCHED_CACHE_WAKE, false)
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
  * in a single rq->lock section. Default disabled because the
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (19 preceding siblings ...)
  2025-06-18 18:28 ` [RFC patch v3 20/20] sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake up Tim Chen
@ 2025-06-19  6:39 ` Yangyu Chen
  2025-06-19 13:21   ` Chen, Yu C
  2025-06-20 19:25 ` Madadi Vineeth Reddy
                   ` (2 subsequent siblings)
  23 siblings, 1 reply; 68+ messages in thread
From: Yangyu Chen @ 2025-06-19  6:39 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Chen Yu

Nice work!

I've tested your patch based on commit fb4d33ab452e and found it
incredibly helpful for Verilator with large RTL simulations like
XiangShan [1] on AMD EPYC Geona.

I've created a simple benchmark [2] using a static build of an
8-thread Verilator of XiangShan. Simply clone the repository and
run `make run`.

In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC
9T24, before the patch, we have a simulation time of 49.348ms. This
was because each thread was distributed across every CCX, resulting
in extremely high core-to-core latency. However, after applying the
patch, the entire 8-thread Verilator is allocated to a single CCX.
Consequently, the simulation time was reduced to 24.196ms, which
is a remarkable 2.03x faster than before. We don't need numactl
anymore!

[1] https://github.com/OpenXiangShan/XiangShan
[2] https://github.com/cyyself/chacha20-xiangshan

Tested-by: Yangyu Chen <cyy@cyyself.name>

Thanks,
Yangyu Chen

On 19/6/2025 02:27, Tim Chen wrote:
> This is the third revision of the cache aware scheduling patches,
> based on the original patch proposed by Peter[1].
>  The goal of the patch series is to aggregate tasks sharing data
> to the same cache domain, thereby reducing cache bouncing and
> cache misses, and improve data access efficiency. In the current
> implementation, threads within the same process are considered
> as entities that potentially share resources.
>  In previous versions, aggregation of tasks were done in the
> wake up path, without making load balancing paths aware of
> LLC (Last-Level-Cache) preference. This led to the following
> problems:
> 1) Aggregation of tasks during wake up led to load imbalance
>    between LLCs
> 2) Load balancing tried to even out the load between LLCs
> 3) Wake up tasks aggregation happened at a faster rate and
>    load balancing moved tasks in opposite directions, leading
>    to continuous and excessive task migrations and regressions
>    in benchmarks like schbench.
> In this version, load balancing is made cache-aware. The main
> idea of cache-aware load balancing consists of two parts:
> 1) Identify tasks that prefer to run on their hottest LLC and
>    move them there.
> 2) Prevent generic load balancing from moving a task out of
>    its hottest LLC.
> By default, LLC task aggregation during wake-up is disabled.
> Conversely, cache-aware load balancing is enabled by default.
> For easier comparison, two scheduler features are introduced:
> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
> wake up and cache-aware load balancing, respectively. By default,
> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
> is only done on load balancing.
> With above default settings, task migrations occur less frequently
> and no longer happen in the latency-sensitive wake-up path.
> The load balancing and migration policy are now implemented in
> a single location within the function _get_migrate_hint().
> Debugfs knobs are also introduced to fine-tune the
> _get_migrate_hint() function. Please refer to patch 7 for
> detail.
> Improvements in performance for hackbench are observed in the
> lower load ranges when tested on a 2 socket sapphire rapids with
> 30 cores per socket. The DRAM interleaving is enabled in the
> BIOS so it essential has one NUMA node with two last level
> caches. Hackbench benefits from having all the threads
> in the process running in the same LLC. There are some small
> regressions for the heavily loaded case when not all threads can
> fit in a LLC.
> Hackbench is run with one process, and pairs of threads ping
> ponging message off each other via command with increasing number
> of thread pairs, each test runs for 10 cycles:
> hackbench -g 1 --thread --pipe(socket) -l 1000000 -s 100 -f <pairs>
> case                    load            baseline(std%)  compare%( std%)
> threads-pipe-8          1-groups         1.00 (  2.70)  +24.51 (  0.59)
> threads-pipe-15         1-groups         1.00 (  1.42)  +28.37 (  0.68)
> threads-pipe-30         1-groups         1.00 (  2.53)  +26.16 (  0.11)
> threads-pipe-45         1-groups         1.00 (  0.48)  +35.38 (  0.18)
> threads-pipe-60         1-groups         1.00 (  2.13)  +13.46 ( 12.81)
> threads-pipe-75         1-groups         1.00 (  1.57)  +16.71 (  0.20)
> threads-pipe-90         1-groups         1.00 (  0.22)   -0.57 (  1.21)
> threads-sockets-8       1-groups         1.00 (  2.82)  +23.04 (  0.83)
> threads-sockets-15      1-groups         1.00 (  2.57)  +21.67 (  1.90)
> threads-sockets-30      1-groups         1.00 (  0.75)  +18.78 (  0.09)
> threads-sockets-45      1-groups         1.00 (  1.63)  +18.89 (  0.43)
> threads-sockets-60      1-groups         1.00 (  0.66)  +10.10 (  1.91)
> threads-sockets-75      1-groups         1.00 (  0.44)  -14.49 (  0.43)
> threads-sockets-90      1-groups         1.00 (  0.15)   -8.03 (  3.88)
> Similar tests were also experimented on schbench on the system.
> Overall latency improvement is observed when underloaded and
> regression when overloaded. The regression is significantly
> smaller than the previous version because cache aware aggregation
> is in load balancing rather than in wake up path. Besides, it is
> found that the schbench seems to have large run-to-run variance,
> so the result of schbench might be only used as reference.
> schbench:
>                                    baseline              nowake_lb
> Lat 50.0th-qrtle-1          5.00 (   0.00%)        5.00 (   0.00%)
> Lat 90.0th-qrtle-1          9.00 (   0.00%)        8.00 (  11.11%)
> Lat 99.0th-qrtle-1         15.00 (   0.00%)       15.00 (   0.00%)
> Lat 99.9th-qrtle-1         32.00 (   0.00%)       23.00 (  28.12%)
> Lat 20.0th-qrtle-1        267.00 (   0.00%)      266.00 (   0.37%)
> Lat 50.0th-qrtle-2          8.00 (   0.00%)        4.00 (  50.00%)
> Lat 90.0th-qrtle-2          9.00 (   0.00%)        7.00 (  22.22%)
> Lat 99.0th-qrtle-2         18.00 (   0.00%)       11.00 (  38.89%)
> Lat 99.9th-qrtle-2         26.00 (   0.00%)       25.00 (   3.85%)
> Lat 20.0th-qrtle-2        535.00 (   0.00%)      537.00 (  -0.37%)
> Lat 50.0th-qrtle-4          6.00 (   0.00%)        4.00 (  33.33%)
> Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
> Lat 99.0th-qrtle-4         13.00 (   0.00%)       10.00 (  23.08%)
> Lat 99.9th-qrtle-4         20.00 (   0.00%)       14.00 (  30.00%)
> Lat 20.0th-qrtle-4       1066.00 (   0.00%)     1050.00 (   1.50%)
> Lat 50.0th-qrtle-8          5.00 (   0.00%)        4.00 (  20.00%)
> Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
> Lat 99.0th-qrtle-8         11.00 (   0.00%)        8.00 (  27.27%)
> Lat 99.9th-qrtle-8         17.00 (   0.00%)       18.00 (  -5.88%)
> Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2156.00 (  -0.75%)
> Lat 50.0th-qrtle-16         6.00 (   0.00%)        4.00 (  33.33%)
> Lat 90.0th-qrtle-16         7.00 (   0.00%)        6.00 (  14.29%)
> Lat 99.0th-qrtle-16        11.00 (   0.00%)       11.00 (   0.00%)
> Lat 99.9th-qrtle-16        18.00 (   0.00%)       18.00 (   0.00%)
> Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4216.00 (   1.86%)
> Lat 50.0th-qrtle-32         6.00 (   0.00%)        4.00 (  33.33%)
> Lat 90.0th-qrtle-32         7.00 (   0.00%)        5.00 (  28.57%)
> Lat 99.0th-qrtle-32        11.00 (   0.00%)        9.00 (  18.18%)
> Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
> Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8624.00 (  -1.51%)
> Lat 50.0th-qrtle-64         5.00 (   0.00%)        5.00 (   0.00%)
> Lat 90.0th-qrtle-64         7.00 (   0.00%)        7.00 (   0.00%)
> Lat 99.0th-qrtle-64        11.00 (   0.00%)       11.00 (   0.00%)
> Lat 99.9th-qrtle-64        17.00 (   0.00%)       18.00 (  -5.88%)
> Lat 20.0th-qrtle-64     17120.00 (   0.00%)    15728.00 (   8.13%)
> Lat 50.0th-qrtle-128        6.00 (   0.00%)        6.00 (   0.00%)
> Lat 90.0th-qrtle-128        9.00 (   0.00%)        8.00 (  11.11%)
> Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
> Lat 99.9th-qrtle-128       20.00 (   0.00%)       26.00 ( -30.00%)
> Lat 20.0th-qrtle-128    19488.00 (   0.00%)    18784.00 (   3.61%)
> Lat 50.0th-qrtle-239        8.00 (   0.00%)        8.00 (   0.00%)
> Lat 90.0th-qrtle-239       16.00 (   0.00%)       14.00 (  12.50%)
> Lat 99.0th-qrtle-239       45.00 (   0.00%)       41.00 (   8.89%)
> Lat 99.9th-qrtle-239      137.00 (   0.00%)      225.00 ( -64.23%)
> Lat 20.0th-qrtle-239    30432.00 (   0.00%)    29920.00 (   1.68%)
> AMD Milan is also tested. There are 4 Nodes and 32 CPUs per node.
> Each node has 4 CCX(shared LLC) and each CCX has 8 CPUs. Hackbench
> with 1 group test scenario benefits from cache aware load balance
> too:
> hackbench(1 group and fd ranges in [1,6]:
> case                    load            baseline(std%)  compare%( std%)
> threads-pipe-1          1-groups         1.00 (  1.22)   +2.84 (  0.51)
> threads-pipe-2          1-groups         1.00 (  5.82)  +42.82 ( 43.61)
> threads-pipe-3          1-groups         1.00 (  3.49)  +17.33 ( 18.68)
> threads-pipe-4          1-groups         1.00 (  2.49)  +12.49 (  5.89)
> threads-pipe-5          1-groups         1.00 (  1.46)   +8.62 (  4.43)
> threads-pipe-6          1-groups         1.00 (  2.83)  +12.73 (  8.94)
> threads-sockets-1       1-groups         1.00 (  1.31)  +28.68 (  2.25)
> threads-sockets-2       1-groups         1.00 (  5.17)  +34.84 ( 36.90)
> threads-sockets-3       1-groups         1.00 (  1.57)   +9.15 (  5.52)
> threads-sockets-4       1-groups         1.00 (  1.99)  +16.51 (  6.04)
> threads-sockets-5       1-groups         1.00 (  2.39)  +10.88 (  2.17)
> threads-sockets-6       1-groups         1.00 (  1.62)   +7.22 (  2.00)
> Besides a single instance of hackbench, four instances of hackbench are
> also tested on Milan. The test results show that different instances of
> hackbench are aggregated to dedicated LLCs, and performance improvement
> is observed.
> schbench mmtests(unstable)
>                                   baseline              nowake_lb
> Lat 50.0th-qrtle-1         9.00 (   0.00%)        8.00 (  11.11%)
> Lat 90.0th-qrtle-1        12.00 (   0.00%)       10.00 (  16.67%)
> Lat 99.0th-qrtle-1        16.00 (   0.00%)       14.00 (  12.50%)
> Lat 99.9th-qrtle-1        22.00 (   0.00%)       21.00 (   4.55%)
> Lat 20.0th-qrtle-1       759.00 (   0.00%)      759.00 (   0.00%)
> Lat 50.0th-qrtle-2         9.00 (   0.00%)        7.00 (  22.22%)
> Lat 90.0th-qrtle-2        12.00 (   0.00%)       12.00 (   0.00%)
> Lat 99.0th-qrtle-2        16.00 (   0.00%)       15.00 (   6.25%)
> Lat 99.9th-qrtle-2        22.00 (   0.00%)       21.00 (   4.55%)
> Lat 20.0th-qrtle-2      1534.00 (   0.00%)     1510.00 (   1.56%)
> Lat 50.0th-qrtle-4         8.00 (   0.00%)        9.00 ( -12.50%)
> Lat 90.0th-qrtle-4        12.00 (   0.00%)       12.00 (   0.00%)
> Lat 99.0th-qrtle-4        15.00 (   0.00%)       16.00 (  -6.67%)
> Lat 99.9th-qrtle-4        21.00 (   0.00%)       23.00 (  -9.52%)
> Lat 20.0th-qrtle-4      3076.00 (   0.00%)     2860.00 (   7.02%)
> Lat 50.0th-qrtle-8        10.00 (   0.00%)        9.00 (  10.00%)
> Lat 90.0th-qrtle-8        12.00 (   0.00%)       13.00 (  -8.33%)
> Lat 99.0th-qrtle-8        17.00 (   0.00%)       17.00 (   0.00%)
> Lat 99.9th-qrtle-8        22.00 (   0.00%)       24.00 (  -9.09%)
> Lat 20.0th-qrtle-8      6232.00 (   0.00%)     5896.00 (   5.39%)
> Lat 50.0th-qrtle-16        9.00 (   0.00%)        9.00 (   0.00%)
> Lat 90.0th-qrtle-16       13.00 (   0.00%)       13.00 (   0.00%)
> Lat 99.0th-qrtle-16       17.00 (   0.00%)       18.00 (  -5.88%)
> Lat 99.9th-qrtle-16       23.00 (   0.00%)       26.00 ( -13.04%)
> Lat 20.0th-qrtle-16    10096.00 (   0.00%)    10352.00 (  -2.54%)
> Lat 50.0th-qrtle-32       15.00 (   0.00%)       15.00 (   0.00%)
> Lat 90.0th-qrtle-32       25.00 (   0.00%)       26.00 (  -4.00%)
> Lat 99.0th-qrtle-32       49.00 (   0.00%)       50.00 (  -2.04%)
> Lat 99.9th-qrtle-32      945.00 (   0.00%)     1005.00 (  -6.35%)
> Lat 20.0th-qrtle-32    11600.00 (   0.00%)    11632.00 (  -0.28%)
> Netperf/Tbench have not been tested yet. As they are single-process
> benchmarks that are not the target of this cache-aware scheduling.
> Additionally, client and server components should be tested on
> different machines or bound to different nodes. Otherwise,
> cache-aware scheduling might harm their performance: placing client
> and server in the same LLC could yield higher throughput due to
> improved cache locality in the TCP/IP stack, whereas cache-aware
> scheduling aims to place them in dedicated LLCs.
> This patch set is applied on v6.15 kernel.
>  There are some further work needed for future versions in this
> patch set.  We will need to align NUMA balancing with LLC aggregations
> such that LLC aggregation will align with the preferred NUMA node.
> Comments and tests are much appreciated.
> [1] https://lore.kernel.org/all/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> The patches are grouped as follow:
> Patch 1:     Peter's original patch.
> Patch 2-5:   Various fixes and tuning of the original v1 patch.
> Patch 6-12:  Infrastructure and helper functions for load balancing to be cache aware.
> Patch 13-18: Add logic to load balancing for preferred LLC aggregation.
> Patch 19:    Add process LLC aggregation in load balancing sched feature.
> Patch 20:    Add Process LLC aggregation in wake up sched feature (turn off by default).
> v1:
> https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> v2:
> https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/
> Chen Yu (3):
>   sched: Several fixes for cache aware scheduling
>   sched: Avoid task migration within its preferred LLC
>   sched: Save the per LLC utilization for better cache aware scheduling
> K Prateek Nayak (1):
>   sched: Avoid calculating the cpumask if the system is overloaded
> Peter Zijlstra (1):
>   sched: Cache aware load-balancing
> Tim Chen (15):
>   sched: Add hysteresis to switch a task's preferred LLC
>   sched: Add helper function to decide whether to allow cache aware
>     scheduling
>   sched: Set up LLC indexing
>   sched: Introduce task preferred LLC field
>   sched: Calculate the number of tasks that have LLC preference on a
>     runqueue
>   sched: Introduce per runqueue task LLC preference counter
>   sched: Calculate the total number of preferred LLC tasks during load
>     balance
>   sched: Tag the sched group as llc_balance if it has tasks prefer other
>     LLC
>   sched: Introduce update_llc_busiest() to deal with groups having
>     preferred LLC tasks
>   sched: Introduce a new migration_type to track the preferred LLC load
>     balance
>   sched: Consider LLC locality for active balance
>   sched: Consider LLC preference when picking tasks from busiest queue
>   sched: Do not migrate task if it is moving out of its preferred LLC
>   sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>   sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>     up
>  include/linux/mm_types.h       |  44 ++
>  include/linux/sched.h          |   8 +
>  include/linux/sched/topology.h |   3 +
>  init/Kconfig                   |   4 +
>  init/init_task.c               |   3 +
>  kernel/fork.c                  |   5 +
>  kernel/sched/core.c            |  25 +-
>  kernel/sched/debug.c           |   4 +
>  kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
>  kernel/sched/features.h        |   3 +
>  kernel/sched/sched.h           |  23 +
>  kernel/sched/topology.c        |  29 ++
>  12 files changed, 982 insertions(+), 28 deletions(-)



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-19  6:39 ` [RFC patch v3 00/20] Cache aware scheduling Yangyu Chen
@ 2025-06-19 13:21   ` Chen, Yu C
  2025-06-19 14:12     ` Yangyu Chen
  0 siblings, 1 reply; 68+ messages in thread
From: Chen, Yu C @ 2025-06-19 13:21 UTC (permalink / raw)
  To: Yangyu Chen, Tim Chen, Peter Zijlstra, Ingo Molnar,
	K Prateek Nayak, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel

On 6/19/2025 2:39 PM, Yangyu Chen wrote:
> Nice work!
> 
> I've tested your patch based on commit fb4d33ab452e and found it
> incredibly helpful for Verilator with large RTL simulations like
> XiangShan [1] on AMD EPYC Geona.
> 
> I've created a simple benchmark [2] using a static build of an
> 8-thread Verilator of XiangShan. Simply clone the repository and
> run `make run`.
> 
> In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC
> 9T24, before the patch, we have a simulation time of 49.348ms. This
> was because each thread was distributed across every CCX, resulting
> in extremely high core-to-core latency. However, after applying the
> patch, the entire 8-thread Verilator is allocated to a single CCX.
> Consequently, the simulation time was reduced to 24.196ms, which
> is a remarkable 2.03x faster than before. We don't need numactl
> anymore!
> 
> [1] https://github.com/OpenXiangShan/XiangShan
> [2] https://github.com/cyyself/chacha20-xiangshan
> 
> Tested-by: Yangyu Chen <cyy@cyyself.name>
> 

Thanks Yangyu for your test. May I know if these 8-threads have any
data sharing with each other, or each thread has their dedicated
data? Or, there is 1 main thread, the other 7 threads do the
chacha20 rotate and put the result to the main thread?
Anyway I tested it on a Xeon EMR with turbo-disabled and saw ~20%
reduction in the total time.

Thanks,
Chenyu


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-19 13:21   ` Chen, Yu C
@ 2025-06-19 14:12     ` Yangyu Chen
  0 siblings, 0 replies; 68+ messages in thread
From: Yangyu Chen @ 2025-06-19 14:12 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel



> On 19 Jun 2025, at 21:21, Chen, Yu C <yu.c.chen@intel.com> wrote:
> 
> On 6/19/2025 2:39 PM, Yangyu Chen wrote:
>> Nice work!
>> I've tested your patch based on commit fb4d33ab452e and found it
>> incredibly helpful for Verilator with large RTL simulations like
>> XiangShan [1] on AMD EPYC Geona.
>> I've created a simple benchmark [2] using a static build of an
>> 8-thread Verilator of XiangShan. Simply clone the repository and
>> run `make run`.
>> In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC
>> 9T24, before the patch, we have a simulation time of 49.348ms. This
>> was because each thread was distributed across every CCX, resulting
>> in extremely high core-to-core latency. However, after applying the
>> patch, the entire 8-thread Verilator is allocated to a single CCX.
>> Consequently, the simulation time was reduced to 24.196ms, which
>> is a remarkable 2.03x faster than before. We don't need numactl
>> anymore!
>> [1] https://github.com/OpenXiangShan/XiangShan
>> [2] https://github.com/cyyself/chacha20-xiangshan
>> Tested-by: Yangyu Chen <cyy@cyyself.name>
> 
> Thanks Yangyu for your test. May I know if these 8-threads have any
> data sharing with each other, or each thread has their dedicated
> data? Or, there is 1 main thread, the other 7 threads do the
> chacha20 rotate and put the result to the main thread?

Ah, I had forgotten to mention the benchmark. The workload is not
about chacha20 itself. This benchmark utilizes a RTL-level simulator
[1] that runs an Open Source OoO CPU core called XiangShan [2]. The
chacha20 algorithm is executed on the guest CPU within this simulator.

The verilator partitions a large RTL design into multiple blocks
of functions and distributes them to each thread. These signals
require synchronization every guest cycle, and synchronization is
also necessary when a dependency exists. Given that we have
approximately 5K guest cycles per second, there is a significant
amount of data that needs to be transferred between each thread.
If there are signal dependencies, this could lead to latency-bound
performance.

[1] https://github.com/verilator/verilator
[2] https://github.com/OpenXiangShan/XiangShan

Thanks,
Yangyu Chen

> Anyway I tested it on a Xeon EMR with turbo-disabled and saw ~20%
> reduction in the total time.

Nice result!

> 
> Thanks,
> Chenyu



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (20 preceding siblings ...)
  2025-06-19  6:39 ` [RFC patch v3 00/20] Cache aware scheduling Yangyu Chen
@ 2025-06-20 19:25 ` Madadi Vineeth Reddy
  2025-06-22  0:39   ` Chen, Yu C
  2025-06-23 16:45   ` Tim Chen
  2025-06-24  5:00 ` K Prateek Nayak
  2025-07-09 19:39 ` Madadi Vineeth Reddy
  23 siblings, 2 replies; 68+ messages in thread
From: Madadi Vineeth Reddy @ 2025-06-20 19:25 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Hillf Danton, Len Brown, linux-kernel,
	Chen Yu, Madadi Vineeth Reddy

Hi Tim,

On 18/06/25 23:57, Tim Chen wrote:
> This is the third revision of the cache aware scheduling patches,
> based on the original patch proposed by Peter[1].
>  
> The goal of the patch series is to aggregate tasks sharing data
> to the same cache domain, thereby reducing cache bouncing and
> cache misses, and improve data access efficiency. In the current
> implementation, threads within the same process are considered
> as entities that potentially share resources.
>  
> In previous versions, aggregation of tasks were done in the
> wake up path, without making load balancing paths aware of
> LLC (Last-Level-Cache) preference. This led to the following
> problems:
> 
> 1) Aggregation of tasks during wake up led to load imbalance
>    between LLCs
> 2) Load balancing tried to even out the load between LLCs
> 3) Wake up tasks aggregation happened at a faster rate and
>    load balancing moved tasks in opposite directions, leading
>    to continuous and excessive task migrations and regressions
>    in benchmarks like schbench.
> 
> In this version, load balancing is made cache-aware. The main
> idea of cache-aware load balancing consists of two parts:
> 
> 1) Identify tasks that prefer to run on their hottest LLC and
>    move them there.
> 2) Prevent generic load balancing from moving a task out of
>    its hottest LLC.
> 
> By default, LLC task aggregation during wake-up is disabled.
> Conversely, cache-aware load balancing is enabled by default.
> For easier comparison, two scheduler features are introduced:
> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
> wake up and cache-aware load balancing, respectively. By default,
> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
> is only done on load balancing.

Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
LLC on this platform spans 4 threads.

schbench:
                        baseline (sd%)        baseline+cacheaware (sd%)      %change
Lat 50.0th-worker-1        6.33 (24.12%)           6.00 (28.87%)               5.21%
Lat 90.0th-worker-1        7.67 ( 7.53%)           7.67 (32.83%)               0.00%
Lat 99.0th-worker-1        8.67 ( 6.66%)           9.33 (37.63%)              -7.61%
Lat 99.9th-worker-1       21.33 (63.99%)          12.33 (28.47%)              42.19%

Lat 50.0th-worker-2        4.33 (13.32%)           5.67 (10.19%)             -30.95%
Lat 90.0th-worker-2        5.67 (20.38%)           7.67 ( 7.53%)             -35.27%
Lat 99.0th-worker-2        7.33 ( 7.87%)           8.33 ( 6.93%)             -13.64%
Lat 99.9th-worker-2       11.67 (24.74%)          10.33 (11.17%)              11.48%

Lat 50.0th-worker-4        5.00 ( 0.00%)           7.00 ( 0.00%)             -40.00%
Lat 90.0th-worker-4        7.00 ( 0.00%)           9.67 ( 5.97%)             -38.14%
Lat 99.0th-worker-4        8.00 ( 0.00%)          11.33 (13.48%)             -41.62%
Lat 99.9th-worker-4       10.33 ( 5.59%)          14.00 ( 7.14%)             -35.53%

Lat 50.0th-worker-8        4.33 (13.32%)           5.67 (10.19%)             -30.95%
Lat 90.0th-worker-8        6.33 (18.23%)           8.67 ( 6.66%)             -36.99%
Lat 99.0th-worker-8        7.67 ( 7.53%)          10.33 ( 5.59%)             -34.69%
Lat 99.9th-worker-8       10.00 (10.00%)          12.33 ( 4.68%)             -23.30%

Lat 50.0th-worker-16       4.00 ( 0.00%)           5.00 ( 0.00%)             -25.00%
Lat 90.0th-worker-16       6.33 ( 9.12%)           7.67 ( 7.53%)             -21.21%
Lat 99.0th-worker-16       8.00 ( 0.00%)          10.33 ( 5.59%)             -29.13%
Lat 99.9th-worker-16      12.00 ( 8.33%)          13.33 ( 4.33%)             -11.08%

Lat 50.0th-worker-32       5.00 ( 0.00%)           5.33 (10.83%)              -6.60%
Lat 90.0th-worker-32       7.00 ( 0.00%)           8.67 (17.63%)             -23.86%
Lat 99.0th-worker-32      10.67 (14.32%)          12.67 ( 4.56%)             -18.75%
Lat 99.9th-worker-32      14.67 ( 3.94%)          19.00 (13.93%)             -29.49%

Lat 50.0th-worker-64       5.33 (10.83%)           6.67 ( 8.66%)             -25.14%
Lat 90.0th-worker-64      10.00 (17.32%)          14.33 ( 4.03%)             -43.30%
Lat 99.0th-worker-64      14.00 ( 7.14%)          16.67 ( 3.46%)             -19.07%
Lat 99.9th-worker-64      55.00 (56.69%)          47.00 (61.92%)              14.55%

Lat 50.0th-worker-128      8.00 ( 0.00%)           8.67 (13.32%)              -8.38%
Lat 90.0th-worker-128     13.33 ( 4.33%)          14.33 ( 8.06%)              -7.50%
Lat 99.0th-worker-128     16.00 ( 0.00%)          20.00 ( 8.66%)             -25.00%
Lat 99.9th-worker-128   2258.33 (83.80%)        2974.67 (21.82%)             -31.72%

Lat 50.0th-worker-256     47.67 ( 2.42%)          45.33 ( 3.37%)               4.91%
Lat 90.0th-worker-256   3470.67 ( 1.88%)        3558.67 ( 0.47%)              -2.54%
Lat 99.0th-worker-256   9040.00 ( 2.76%)        9050.67 ( 0.41%)              -0.12%
Lat 99.9th-worker-256  13824.00 (20.07%)       13104.00 ( 6.84%)               5.21%

The above data shows mostly regression both in the lesser and
higher load cases.


Hackbench pipe:

Pairs   Baseline Avg (s) (Std%)     Patched Avg (s) (Std%)      % Change
2       2.987 (1.19%)               2.414 (17.99%)              24.06%
4       7.702 (12.53%)              7.228 (18.37%)               6.16%
8       14.141 (1.32%)              13.109 (1.46%)               7.29%
15      27.571 (6.53%)              29.460 (8.71%)              -6.84%
30      65.118 (4.49%)              61.352 (4.00%)               5.78%
45      105.086 (9.75%)             97.970 (4.26%)               6.77%
60      149.221 (6.91%)             154.176 (4.17%)             -3.32%
75      199.278 (1.21%)             198.680 (1.37%)              0.30%

A lot of run to run variation is seen in hackbench runs. So hard to tell
on the performance but looks better than schbench.

In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
when compared to platforms like sapphire rapids and Milan. Didn't go
through this series yet. Will go through and try to understand why
schbench is not happy on Power systems.

Meanwhile, Wanted to know your thoughts on how does smaller LLC
size get impacted with this patch?

Thanks,
Madadi Vineeth Reddy


> 
> With above default settings, task migrations occur less frequently
> and no longer happen in the latency-sensitive wake-up path.
> 

[..snip..]

> 
> Chen Yu (3):
>   sched: Several fixes for cache aware scheduling
>   sched: Avoid task migration within its preferred LLC
>   sched: Save the per LLC utilization for better cache aware scheduling
> 
> K Prateek Nayak (1):
>   sched: Avoid calculating the cpumask if the system is overloaded
> 
> Peter Zijlstra (1):
>   sched: Cache aware load-balancing
> 
> Tim Chen (15):
>   sched: Add hysteresis to switch a task's preferred LLC
>   sched: Add helper function to decide whether to allow cache aware
>     scheduling
>   sched: Set up LLC indexing
>   sched: Introduce task preferred LLC field
>   sched: Calculate the number of tasks that have LLC preference on a
>     runqueue
>   sched: Introduce per runqueue task LLC preference counter
>   sched: Calculate the total number of preferred LLC tasks during load
>     balance
>   sched: Tag the sched group as llc_balance if it has tasks prefer other
>     LLC
>   sched: Introduce update_llc_busiest() to deal with groups having
>     preferred LLC tasks
>   sched: Introduce a new migration_type to track the preferred LLC load
>     balance
>   sched: Consider LLC locality for active balance
>   sched: Consider LLC preference when picking tasks from busiest queue
>   sched: Do not migrate task if it is moving out of its preferred LLC
>   sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>   sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>     up
> 
>  include/linux/mm_types.h       |  44 ++
>  include/linux/sched.h          |   8 +
>  include/linux/sched/topology.h |   3 +
>  init/Kconfig                   |   4 +
>  init/init_task.c               |   3 +
>  kernel/fork.c                  |   5 +
>  kernel/sched/core.c            |  25 +-
>  kernel/sched/debug.c           |   4 +
>  kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
>  kernel/sched/features.h        |   3 +
>  kernel/sched/sched.h           |  23 +
>  kernel/sched/topology.c        |  29 ++
>  12 files changed, 982 insertions(+), 28 deletions(-)
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-20 19:25 ` Madadi Vineeth Reddy
@ 2025-06-22  0:39   ` Chen, Yu C
  2025-06-24 17:47     ` Madadi Vineeth Reddy
  2025-06-23 16:45   ` Tim Chen
  1 sibling, 1 reply; 68+ messages in thread
From: Chen, Yu C @ 2025-06-22  0:39 UTC (permalink / raw)
  To: Madadi Vineeth Reddy, Tim Chen, Peter Zijlstra, Ingo Molnar,
	K Prateek Nayak, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Hillf Danton, Len Brown, linux-kernel

On 6/21/2025 3:25 AM, Madadi Vineeth Reddy wrote:
> Hi Tim,
> 
> On 18/06/25 23:57, Tim Chen wrote:
>> This is the third revision of the cache aware scheduling patches,
>> based on the original patch proposed by Peter[1].
>>   
>> The goal of the patch series is to aggregate tasks sharing data
>> to the same cache domain, thereby reducing cache bouncing and
>> cache misses, and improve data access efficiency. In the current
>> implementation, threads within the same process are considered
>> as entities that potentially share resources.
>>   
>> In previous versions, aggregation of tasks were done in the
>> wake up path, without making load balancing paths aware of
>> LLC (Last-Level-Cache) preference. This led to the following
>> problems:
>>
>> 1) Aggregation of tasks during wake up led to load imbalance
>>     between LLCs
>> 2) Load balancing tried to even out the load between LLCs
>> 3) Wake up tasks aggregation happened at a faster rate and
>>     load balancing moved tasks in opposite directions, leading
>>     to continuous and excessive task migrations and regressions
>>     in benchmarks like schbench.
>>
>> In this version, load balancing is made cache-aware. The main
>> idea of cache-aware load balancing consists of two parts:
>>
>> 1) Identify tasks that prefer to run on their hottest LLC and
>>     move them there.
>> 2) Prevent generic load balancing from moving a task out of
>>     its hottest LLC.
>>
>> By default, LLC task aggregation during wake-up is disabled.
>> Conversely, cache-aware load balancing is enabled by default.
>> For easier comparison, two scheduler features are introduced:
>> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
>> wake up and cache-aware load balancing, respectively. By default,
>> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
>> is only done on load balancing.
> 
> Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
> LLC on this platform spans 4 threads.
> 
> schbench:
>                          baseline (sd%)        baseline+cacheaware (sd%)      %change
> Lat 50.0th-worker-1        6.33 (24.12%)           6.00 (28.87%)               5.21%
> Lat 90.0th-worker-1        7.67 ( 7.53%)           7.67 (32.83%)               0.00%
> Lat 99.0th-worker-1        8.67 ( 6.66%)           9.33 (37.63%)              -7.61%
> Lat 99.9th-worker-1       21.33 (63.99%)          12.33 (28.47%)              42.19%
> 
> Lat 50.0th-worker-2        4.33 (13.32%)           5.67 (10.19%)             -30.95%
> Lat 90.0th-worker-2        5.67 (20.38%)           7.67 ( 7.53%)             -35.27%
> Lat 99.0th-worker-2        7.33 ( 7.87%)           8.33 ( 6.93%)             -13.64%
> Lat 99.9th-worker-2       11.67 (24.74%)          10.33 (11.17%)              11.48%
> 
> Lat 50.0th-worker-4        5.00 ( 0.00%)           7.00 ( 0.00%)             -40.00%
> Lat 90.0th-worker-4        7.00 ( 0.00%)           9.67 ( 5.97%)             -38.14%
> Lat 99.0th-worker-4        8.00 ( 0.00%)          11.33 (13.48%)             -41.62%
> Lat 99.9th-worker-4       10.33 ( 5.59%)          14.00 ( 7.14%)             -35.53%
> 
> Lat 50.0th-worker-8        4.33 (13.32%)           5.67 (10.19%)             -30.95%
> Lat 90.0th-worker-8        6.33 (18.23%)           8.67 ( 6.66%)             -36.99%
> Lat 99.0th-worker-8        7.67 ( 7.53%)          10.33 ( 5.59%)             -34.69%
> Lat 99.9th-worker-8       10.00 (10.00%)          12.33 ( 4.68%)             -23.30%
> 
> Lat 50.0th-worker-16       4.00 ( 0.00%)           5.00 ( 0.00%)             -25.00%
> Lat 90.0th-worker-16       6.33 ( 9.12%)           7.67 ( 7.53%)             -21.21%
> Lat 99.0th-worker-16       8.00 ( 0.00%)          10.33 ( 5.59%)             -29.13%
> Lat 99.9th-worker-16      12.00 ( 8.33%)          13.33 ( 4.33%)             -11.08%
> 
> Lat 50.0th-worker-32       5.00 ( 0.00%)           5.33 (10.83%)              -6.60%
> Lat 90.0th-worker-32       7.00 ( 0.00%)           8.67 (17.63%)             -23.86%
> Lat 99.0th-worker-32      10.67 (14.32%)          12.67 ( 4.56%)             -18.75%
> Lat 99.9th-worker-32      14.67 ( 3.94%)          19.00 (13.93%)             -29.49%
> 
> Lat 50.0th-worker-64       5.33 (10.83%)           6.67 ( 8.66%)             -25.14%
> Lat 90.0th-worker-64      10.00 (17.32%)          14.33 ( 4.03%)             -43.30%
> Lat 99.0th-worker-64      14.00 ( 7.14%)          16.67 ( 3.46%)             -19.07%
> Lat 99.9th-worker-64      55.00 (56.69%)          47.00 (61.92%)              14.55%
> 
> Lat 50.0th-worker-128      8.00 ( 0.00%)           8.67 (13.32%)              -8.38%
> Lat 90.0th-worker-128     13.33 ( 4.33%)          14.33 ( 8.06%)              -7.50%
> Lat 99.0th-worker-128     16.00 ( 0.00%)          20.00 ( 8.66%)             -25.00%
> Lat 99.9th-worker-128   2258.33 (83.80%)        2974.67 (21.82%)             -31.72%
> 
> Lat 50.0th-worker-256     47.67 ( 2.42%)          45.33 ( 3.37%)               4.91%
> Lat 90.0th-worker-256   3470.67 ( 1.88%)        3558.67 ( 0.47%)              -2.54%
> Lat 99.0th-worker-256   9040.00 ( 2.76%)        9050.67 ( 0.41%)              -0.12%
> Lat 99.9th-worker-256  13824.00 (20.07%)       13104.00 ( 6.84%)               5.21%
> 
> The above data shows mostly regression both in the lesser and
> higher load cases.
> 
> 
> Hackbench pipe:
> 
> Pairs   Baseline Avg (s) (Std%)     Patched Avg (s) (Std%)      % Change
> 2       2.987 (1.19%)               2.414 (17.99%)              24.06%
> 4       7.702 (12.53%)              7.228 (18.37%)               6.16%
> 8       14.141 (1.32%)              13.109 (1.46%)               7.29%
> 15      27.571 (6.53%)              29.460 (8.71%)              -6.84%
> 30      65.118 (4.49%)              61.352 (4.00%)               5.78%
> 45      105.086 (9.75%)             97.970 (4.26%)               6.77%
> 60      149.221 (6.91%)             154.176 (4.17%)             -3.32%
> 75      199.278 (1.21%)             198.680 (1.37%)              0.30%
> 
> A lot of run to run variation is seen in hackbench runs. So hard to tell
> on the performance but looks better than schbench.

May I know if the cpu frequency was set at a fixed level and deep
cpu idle states were disabled(I assume on power system it is called
stop states?)

> 
> In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
> when compared to platforms like sapphire rapids and Milan. Didn't go
> through this series yet. Will go through and try to understand why
> schbench is not happy on Power systems.
> 
> Meanwhile, Wanted to know your thoughts on how does smaller LLC
> size get impacted with this patch?
> 

task aggregation on smaller LLC domain(both in terms of the
number of CPUs and the size of LLC) might bring cache contention
and hurt performance IMO. May I know what is the cache size on
your system:
lscpu | grep "L3 cache"

May I know if you tested it with:
echo NO_SCHED_CACHE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_LB > /sys/kernel/debug/sched/features

vs

echo SCHED_CACHE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
echo SCHED_CACHE_LB > /sys/kernel/debug/sched/features

And could you help check if setting /sys/kernel/debug/sched/llc_aggr_cap
from 50 to some smaller values(25, etc) would help?

thanks,
Chenyu

> Thanks,
> Madadi Vineeth Reddy
> 
> 
>>
>> With above default settings, task migrations occur less frequently
>> and no longer happen in the latency-sensitive wake-up path.
>>
> 
> [..snip..]
> 
>>
>> Chen Yu (3):
>>    sched: Several fixes for cache aware scheduling
>>    sched: Avoid task migration within its preferred LLC
>>    sched: Save the per LLC utilization for better cache aware scheduling
>>
>> K Prateek Nayak (1):
>>    sched: Avoid calculating the cpumask if the system is overloaded
>>
>> Peter Zijlstra (1):
>>    sched: Cache aware load-balancing
>>
>> Tim Chen (15):
>>    sched: Add hysteresis to switch a task's preferred LLC
>>    sched: Add helper function to decide whether to allow cache aware
>>      scheduling
>>    sched: Set up LLC indexing
>>    sched: Introduce task preferred LLC field
>>    sched: Calculate the number of tasks that have LLC preference on a
>>      runqueue
>>    sched: Introduce per runqueue task LLC preference counter
>>    sched: Calculate the total number of preferred LLC tasks during load
>>      balance
>>    sched: Tag the sched group as llc_balance if it has tasks prefer other
>>      LLC
>>    sched: Introduce update_llc_busiest() to deal with groups having
>>      preferred LLC tasks
>>    sched: Introduce a new migration_type to track the preferred LLC load
>>      balance
>>    sched: Consider LLC locality for active balance
>>    sched: Consider LLC preference when picking tasks from busiest queue
>>    sched: Do not migrate task if it is moving out of its preferred LLC
>>    sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>>    sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>>      up
>>
>>   include/linux/mm_types.h       |  44 ++
>>   include/linux/sched.h          |   8 +
>>   include/linux/sched/topology.h |   3 +
>>   init/Kconfig                   |   4 +
>>   init/init_task.c               |   3 +
>>   kernel/fork.c                  |   5 +
>>   kernel/sched/core.c            |  25 +-
>>   kernel/sched/debug.c           |   4 +
>>   kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
>>   kernel/sched/features.h        |   3 +
>>   kernel/sched/sched.h           |  23 +
>>   kernel/sched/topology.c        |  29 ++
>>   12 files changed, 982 insertions(+), 28 deletions(-)
>>
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-20 19:25 ` Madadi Vineeth Reddy
  2025-06-22  0:39   ` Chen, Yu C
@ 2025-06-23 16:45   ` Tim Chen
  1 sibling, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-06-23 16:45 UTC (permalink / raw)
  To: Madadi Vineeth Reddy, Peter Zijlstra, Ingo Molnar,
	K Prateek Nayak, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Hillf Danton, Len Brown, linux-kernel,
	Chen Yu

On Sat, 2025-06-21 at 00:55 +0530, Madadi Vineeth Reddy wrote:
> Hi Tim,
> 
> On 18/06/25 23:57, Tim Chen wrote:
> > This is the third revision of the cache aware scheduling patches,
> > based on the original patch proposed by Peter[1].
> >  
> > The goal of the patch series is to aggregate tasks sharing data
> > to the same cache domain, thereby reducing cache bouncing and
> > cache misses, and improve data access efficiency. In the current
> > implementation, threads within the same process are considered
> > as entities that potentially share resources.
> >  
> > In previous versions, aggregation of tasks were done in the
> > wake up path, without making load balancing paths aware of
> > LLC (Last-Level-Cache) preference. This led to the following
> > problems:
> > 
> > 1) Aggregation of tasks during wake up led to load imbalance
> >    between LLCs
> > 2) Load balancing tried to even out the load between LLCs
> > 3) Wake up tasks aggregation happened at a faster rate and
> >    load balancing moved tasks in opposite directions, leading
> >    to continuous and excessive task migrations and regressions
> >    in benchmarks like schbench.
> > 
> > In this version, load balancing is made cache-aware. The main
> > idea of cache-aware load balancing consists of two parts:
> > 
> > 1) Identify tasks that prefer to run on their hottest LLC and
> >    move them there.
> > 2) Prevent generic load balancing from moving a task out of
> >    its hottest LLC.
> > 
> > By default, LLC task aggregation during wake-up is disabled.
> > Conversely, cache-aware load balancing is enabled by default.
> > For easier comparison, two scheduler features are introduced:
> > SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
> > wake up and cache-aware load balancing, respectively. By default,
> > NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
> > is only done on load balancing.
> 
> Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
> LLC on this platform spans 4 threads.

Hi Madadi,

Thank you for testing this patch series.

If I understand correctly, the Power 11 you tested has 8 threads per core.
My suspicion is we benefit much more from utilizing more cores
than aggregating the load on less cores but sharing the cache
more in this case.


> 
> schbench:
>                         baseline (sd%)        baseline+cacheaware (sd%)      %change
> Lat 50.0th-worker-1        6.33 (24.12%)           6.00 (28.87%)               5.21%
> Lat 90.0th-worker-1        7.67 ( 7.53%)           7.67 (32.83%)               0.00%
> Lat 99.0th-worker-1        8.67 ( 6.66%)           9.33 (37.63%)              -7.61%
> Lat 99.9th-worker-1       21.33 (63.99%)          12.33 (28.47%)              42.19%
> 
> Lat 50.0th-worker-2        4.33 (13.32%)           5.67 (10.19%)             -30.95%
> Lat 90.0th-worker-2        5.67 (20.38%)           7.67 ( 7.53%)             -35.27%
> Lat 99.0th-worker-2        7.33 ( 7.87%)           8.33 ( 6.93%)             -13.64%
> Lat 99.9th-worker-2       11.67 (24.74%)          10.33 (11.17%)              11.48%
> 
> Lat 50.0th-worker-4        5.00 ( 0.00%)           7.00 ( 0.00%)             -40.00%
> Lat 90.0th-worker-4        7.00 ( 0.00%)           9.67 ( 5.97%)             -38.14%
> Lat 99.0th-worker-4        8.00 ( 0.00%)          11.33 (13.48%)             -41.62%
> Lat 99.9th-worker-4       10.33 ( 5.59%)          14.00 ( 7.14%)             -35.53%
> 
> Lat 50.0th-worker-8        4.33 (13.32%)           5.67 (10.19%)             -30.95%
> Lat 90.0th-worker-8        6.33 (18.23%)           8.67 ( 6.66%)             -36.99%
> Lat 99.0th-worker-8        7.67 ( 7.53%)          10.33 ( 5.59%)             -34.69%
> Lat 99.9th-worker-8       10.00 (10.00%)          12.33 ( 4.68%)             -23.30%
> 
> Lat 50.0th-worker-16       4.00 ( 0.00%)           5.00 ( 0.00%)             -25.00%
> Lat 90.0th-worker-16       6.33 ( 9.12%)           7.67 ( 7.53%)             -21.21%
> Lat 99.0th-worker-16       8.00 ( 0.00%)          10.33 ( 5.59%)             -29.13%
> Lat 99.9th-worker-16      12.00 ( 8.33%)          13.33 ( 4.33%)             -11.08%
> 
> Lat 50.0th-worker-32       5.00 ( 0.00%)           5.33 (10.83%)              -6.60%
> Lat 90.0th-worker-32       7.00 ( 0.00%)           8.67 (17.63%)             -23.86%
> Lat 99.0th-worker-32      10.67 (14.32%)          12.67 ( 4.56%)             -18.75%
> Lat 99.9th-worker-32      14.67 ( 3.94%)          19.00 (13.93%)             -29.49%
> 
> Lat 50.0th-worker-64       5.33 (10.83%)           6.67 ( 8.66%)             -25.14%
> Lat 90.0th-worker-64      10.00 (17.32%)          14.33 ( 4.03%)             -43.30%
> Lat 99.0th-worker-64      14.00 ( 7.14%)          16.67 ( 3.46%)             -19.07%
> Lat 99.9th-worker-64      55.00 (56.69%)          47.00 (61.92%)              14.55%
> 
> Lat 50.0th-worker-128      8.00 ( 0.00%)           8.67 (13.32%)              -8.38%
> Lat 90.0th-worker-128     13.33 ( 4.33%)          14.33 ( 8.06%)              -7.50%
> Lat 99.0th-worker-128     16.00 ( 0.00%)          20.00 ( 8.66%)             -25.00%
> Lat 99.9th-worker-128   2258.33 (83.80%)        2974.67 (21.82%)             -31.72%
> 
> Lat 50.0th-worker-256     47.67 ( 2.42%)          45.33 ( 3.37%)               4.91%
> Lat 90.0th-worker-256   3470.67 ( 1.88%)        3558.67 ( 0.47%)              -2.54%
> Lat 99.0th-worker-256   9040.00 ( 2.76%)        9050.67 ( 0.41%)              -0.12%
> Lat 99.9th-worker-256  13824.00 (20.07%)       13104.00 ( 6.84%)               5.21%
> 
> The above data shows mostly regression both in the lesser and
> higher load cases.
> 
> 
> Hackbench pipe:
> 
> Pairs   Baseline Avg (s) (Std%)     Patched Avg (s) (Std%)      % Change
> 2       2.987 (1.19%)               2.414 (17.99%)              24.06%
> 4       7.702 (12.53%)              7.228 (18.37%)               6.16%
> 8       14.141 (1.32%)              13.109 (1.46%)               7.29%
> 15      27.571 (6.53%)              29.460 (8.71%)              -6.84%
> 30      65.118 (4.49%)              61.352 (4.00%)               5.78%
> 45      105.086 (9.75%)             97.970 (4.26%)               6.77%
> 60      149.221 (6.91%)             154.176 (4.17%)             -3.32%
> 75      199.278 (1.21%)             198.680 (1.37%)              0.30%
> 
> A lot of run to run variation is seen in hackbench runs. So hard to tell
> on the performance but looks better than schbench.
> 
> In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
> when compared to platforms like sapphire rapids and Milan. Didn't go
> through this series yet. Will go through and try to understand why
> schbench is not happy on Power systems.

My guess is having 8 threads per core, LLC aggregation may have
been too aggressive in consolidating tasks on fewer cores and may have left some
cpu cycles unused. Doing experiments by running one thread per core on Power11
may give us some insights if this conjecture is true.

> 
> Meanwhile, Wanted to know your thoughts on how does smaller LLC
> size get impacted with this patch?
> 

This patch series is currently tuned for systems with single threaded core,
and having many cores and large cache per LLC.  

With only 4 cores and 32 threads per LLC as in Power 11, we run out of cores quickly
and have more cache contention between the tasks consolidated.
We may have to set aggregation threshold (sysctl_llc_aggr_cap) less
than 50% utilization (default), so we consolidate less aggressively
and spread the tasks much sooner. 


Tim

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (21 preceding siblings ...)
  2025-06-20 19:25 ` Madadi Vineeth Reddy
@ 2025-06-24  5:00 ` K Prateek Nayak
  2025-06-24 12:16   ` Chen, Yu C
                     ` (2 more replies)
  2025-07-09 19:39 ` Madadi Vineeth Reddy
  23 siblings, 3 replies; 68+ messages in thread
From: K Prateek Nayak @ 2025-06-24  5:00 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Chen Yu

Hello Tim,

On 6/18/2025 11:57 PM, Tim Chen wrote:
> AMD Milan is also tested. There are 4 Nodes and 32 CPUs per node.
> Each node has 4 CCX(shared LLC) and each CCX has 8 CPUs. Hackbench
> with 1 group test scenario benefits from cache aware load balance
> too:
> 
> hackbench(1 group and fd ranges in [1,6]:
> case                    load            baseline(std%)  compare%( std%)
> threads-pipe-1          1-groups         1.00 (  1.22)   +2.84 (  0.51)
> threads-pipe-2          1-groups         1.00 (  5.82)  +42.82 ( 43.61)
> threads-pipe-3          1-groups         1.00 (  3.49)  +17.33 ( 18.68)
> threads-pipe-4          1-groups         1.00 (  2.49)  +12.49 (  5.89)
> threads-pipe-5          1-groups         1.00 (  1.46)   +8.62 (  4.43)
> threads-pipe-6          1-groups         1.00 (  2.83)  +12.73 (  8.94)
> threads-sockets-1       1-groups         1.00 (  1.31)  +28.68 (  2.25)
> threads-sockets-2       1-groups         1.00 (  5.17)  +34.84 ( 36.90)
> threads-sockets-3       1-groups         1.00 (  1.57)   +9.15 (  5.52)
> threads-sockets-4       1-groups         1.00 (  1.99)  +16.51 (  6.04)
> threads-sockets-5       1-groups         1.00 (  2.39)  +10.88 (  2.17)
> threads-sockets-6       1-groups         1.00 (  1.62)   +7.22 (  2.00)
> 
> Besides a single instance of hackbench, four instances of hackbench are
> also tested on Milan. The test results show that different instances of
> hackbench are aggregated to dedicated LLCs, and performance improvement
> is observed.
> 
> schbench mmtests(unstable)
>                                    baseline              nowake_lb
> Lat 50.0th-qrtle-1         9.00 (   0.00%)        8.00 (  11.11%)
> Lat 90.0th-qrtle-1        12.00 (   0.00%)       10.00 (  16.67%)
> Lat 99.0th-qrtle-1        16.00 (   0.00%)       14.00 (  12.50%)
> Lat 99.9th-qrtle-1        22.00 (   0.00%)       21.00 (   4.55%)
> Lat 20.0th-qrtle-1       759.00 (   0.00%)      759.00 (   0.00%)
> Lat 50.0th-qrtle-2         9.00 (   0.00%)        7.00 (  22.22%)
> Lat 90.0th-qrtle-2        12.00 (   0.00%)       12.00 (   0.00%)
> Lat 99.0th-qrtle-2        16.00 (   0.00%)       15.00 (   6.25%)
> Lat 99.9th-qrtle-2        22.00 (   0.00%)       21.00 (   4.55%)
> Lat 20.0th-qrtle-2      1534.00 (   0.00%)     1510.00 (   1.56%)
> Lat 50.0th-qrtle-4         8.00 (   0.00%)        9.00 ( -12.50%)
> Lat 90.0th-qrtle-4        12.00 (   0.00%)       12.00 (   0.00%)
> Lat 99.0th-qrtle-4        15.00 (   0.00%)       16.00 (  -6.67%)
> Lat 99.9th-qrtle-4        21.00 (   0.00%)       23.00 (  -9.52%)
> Lat 20.0th-qrtle-4      3076.00 (   0.00%)     2860.00 (   7.02%)
> Lat 50.0th-qrtle-8        10.00 (   0.00%)        9.00 (  10.00%)
> Lat 90.0th-qrtle-8        12.00 (   0.00%)       13.00 (  -8.33%)
> Lat 99.0th-qrtle-8        17.00 (   0.00%)       17.00 (   0.00%)
> Lat 99.9th-qrtle-8        22.00 (   0.00%)       24.00 (  -9.09%)
> Lat 20.0th-qrtle-8      6232.00 (   0.00%)     5896.00 (   5.39%)
> Lat 50.0th-qrtle-16        9.00 (   0.00%)        9.00 (   0.00%)
> Lat 90.0th-qrtle-16       13.00 (   0.00%)       13.00 (   0.00%)
> Lat 99.0th-qrtle-16       17.00 (   0.00%)       18.00 (  -5.88%)
> Lat 99.9th-qrtle-16       23.00 (   0.00%)       26.00 ( -13.04%)
> Lat 20.0th-qrtle-16    10096.00 (   0.00%)    10352.00 (  -2.54%)
> Lat 50.0th-qrtle-32       15.00 (   0.00%)       15.00 (   0.00%)
> Lat 90.0th-qrtle-32       25.00 (   0.00%)       26.00 (  -4.00%)
> Lat 99.0th-qrtle-32       49.00 (   0.00%)       50.00 (  -2.04%)
> Lat 99.9th-qrtle-32      945.00 (   0.00%)     1005.00 (  -6.35%)
> Lat 20.0th-qrtle-32    11600.00 (   0.00%)    11632.00 (  -0.28%)
> 
> Netperf/Tbench have not been tested yet. As they are single-process
> benchmarks that are not the target of this cache-aware scheduling.
> Additionally, client and server components should be tested on
> different machines or bound to different nodes. Otherwise,
> cache-aware scheduling might harm their performance: placing client
> and server in the same LLC could yield higher throughput due to
> improved cache locality in the TCP/IP stack, whereas cache-aware
> scheduling aims to place them in dedicated LLCs.

I have similar observation from my testing.

tl;dr

o Benchmark that prefer co-location and run in threaded mode see
   a benefit including hackbench at high utilization and schbench
   at low utilization.

o schbench (both new and old but particularly the old) regresses
   quite a bit on the tial latency metric when #workers cross the
   LLC size.

o client-server benchmarks where client and servers are threads
   from different processes (netserver-netperf, tbench_srv-tbench,
   services of DeathStarBench) seem to noticeably regress due to
   lack of co-location between the communicating client and server.

   Not sure if WF_SYNC can be an indicator to temporarily ignore
   the preferred LLC hint.

o stream regresses in some runs where the occupancy metrics trip
   and assign a preferred LLC for all the stream threads bringing
   down performance in !50% of the runs.

Full data from my testing is as follows:

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel details

tip:	  tip:sched/core at commit 914873bc7df9 ("Merge tag
            'x86-build-2025-05-25' of
            git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

llc-aware-lb-v3: tip + this series as is

o Benchmark results

     ==================================================================
     Test          : hackbench
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:           tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      1-groups     1.00 [ -0.00](13.74)     1.03 [ -2.77](12.01)
      2-groups     1.00 [ -0.00]( 9.58)     1.02 [ -1.78]( 6.12)
      4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -0.87]( 0.91)
      8-groups     1.00 [ -0.00]( 1.51)     1.03 [ -3.31]( 2.06)
     16-groups     1.00 [ -0.00]( 1.10)     0.95 [  5.36]( 1.67)


     ==================================================================
     Test          : tbench
     Units         : Normalized throughput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:    tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
         1     1.00 [  0.00]( 0.82)     0.96 [ -3.68]( 1.23)
         2     1.00 [  0.00]( 1.13)     0.98 [ -2.30]( 0.51)
         4     1.00 [  0.00]( 1.12)     0.96 [ -4.14]( 0.22)
         8     1.00 [  0.00]( 0.93)     0.96 [ -3.61]( 0.46)
        16     1.00 [  0.00]( 0.38)     0.95 [ -4.98]( 1.26)
        32     1.00 [  0.00]( 0.66)     0.93 [ -7.12]( 2.22)
        64     1.00 [  0.00]( 1.18)     0.95 [ -5.44]( 0.37)
       128     1.00 [  0.00]( 1.12)     0.93 [ -6.78]( 0.64)
       256     1.00 [  0.00]( 0.42)     0.94 [ -6.45]( 0.47)
       512     1.00 [  0.00]( 0.14)     0.93 [ -7.26]( 0.27)
      1024     1.00 [  0.00]( 0.26)     0.92 [ -7.57]( 0.31)


     ==================================================================
     Test          : stream-10
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      Copy     1.00 [  0.00]( 8.37)     0.39 [-61.05](44.88)
     Scale     1.00 [  0.00]( 2.85)     0.43 [-57.26](40.60)
       Add     1.00 [  0.00]( 3.39)     0.40 [-59.88](42.02)
     Triad     1.00 [  0.00]( 6.39)     0.41 [-58.93](42.98)


     ==================================================================
     Test          : stream-100
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      Copy     1.00 [  0.00]( 3.91)     0.36 [-63.95](51.04)
     Scale     1.00 [  0.00]( 4.34)     0.40 [-60.31](43.12)
       Add     1.00 [  0.00]( 4.14)     0.38 [-62.46](43.40)
     Triad     1.00 [  0.00]( 1.00)     0.36 [-64.38](43.12)


     ==================================================================
     Test          : netperf
     Units         : Normalized Througput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:         tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      1-clients     1.00 [  0.00]( 0.41)     0.97 [ -3.26]( 1.30)
      2-clients     1.00 [  0.00]( 0.58)     0.96 [ -4.24]( 0.71)
      4-clients     1.00 [  0.00]( 0.35)     0.96 [ -4.19]( 0.67)
      8-clients     1.00 [  0.00]( 0.48)     0.95 [ -5.41]( 1.36)
     16-clients     1.00 [  0.00]( 0.66)     0.95 [ -5.31]( 0.93)
     32-clients     1.00 [  0.00]( 1.15)     0.94 [ -6.43]( 1.44)
     64-clients     1.00 [  0.00]( 1.38)     0.93 [ -7.14]( 1.63)
     128-clients    1.00 [  0.00]( 0.87)     0.89 [-10.62]( 0.78)
     256-clients    1.00 [  0.00]( 5.36)     0.92 [ -8.04]( 2.64)
     512-clients    1.00 [  0.00](54.39)     0.88 [-12.12](48.87)


     ==================================================================
     Test          : schbench
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
       1     1.00 [ -0.00]( 8.54)     0.54 [ 45.65](28.79)
       2     1.00 [ -0.00]( 1.15)     0.56 [ 44.00]( 2.09)
       4     1.00 [ -0.00](13.46)     0.67 [ 33.33](35.68)
       8     1.00 [ -0.00]( 7.14)     0.63 [ 36.84]( 4.28)
      16     1.00 [ -0.00]( 3.49)     1.05 [ -5.08]( 9.13)
      32     1.00 [ -0.00]( 1.06)    32.04 [-3104.26](81.31)
      64     1.00 [ -0.00]( 5.48)    24.51 [-2351.16](81.18)
     128     1.00 [ -0.00](10.45)    14.56 [-1356.07]( 5.35)
     256     1.00 [ -0.00](31.14)     0.95 [  4.80](20.88)
     512     1.00 [ -0.00]( 1.52)     1.00 [ -0.25]( 1.26)


     ==================================================================
     Test          : new-schbench-requests-per-second
     Units         : Normalized Requests per second
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
       1     1.00 [  0.00]( 1.07)     0.97 [ -3.24]( 0.98)
       2     1.00 [  0.00]( 0.00)     0.99 [ -1.17]( 0.15)
       4     1.00 [  0.00]( 0.00)     0.96 [ -3.50]( 0.56)
       8     1.00 [  0.00]( 0.15)     0.98 [ -1.76]( 0.31)
      16     1.00 [  0.00]( 0.00)     0.94 [ -6.13]( 1.93)
      32     1.00 [  0.00]( 3.41)     0.97 [ -3.18]( 2.10)
      64     1.00 [  0.00]( 1.05)     0.82 [-18.14](18.41)
     128     1.00 [  0.00]( 0.00)     0.98 [ -2.27]( 0.20)
     256     1.00 [  0.00]( 0.72)     1.01 [  1.23]( 0.31)
     512     1.00 [  0.00]( 0.57)     1.00 [  0.00]( 0.12)


     ==================================================================
     Test          : new-schbench-wakeup-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
       1     1.00 [ -0.00]( 9.11)     0.88 [ 12.50](11.92)
       2     1.00 [ -0.00]( 0.00)     0.86 [ 14.29](11.92)
       4     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 4.08)
       8     1.00 [ -0.00]( 0.00)     0.83 [ 16.67]( 5.34)
      16     1.00 [ -0.00]( 7.56)     0.85 [ 15.38]( 0.00)
      32     1.00 [ -0.00](15.11)     0.80 [ 20.00]( 4.19)
      64     1.00 [ -0.00]( 9.63)     1.05 [ -5.00](24.47)
     128     1.00 [ -0.00]( 4.86)     1.57 [-56.78](68.52)
     256     1.00 [ -0.00]( 2.34)     1.00 [ -0.00]( 0.57)
     512     1.00 [ -0.00]( 0.40)     1.00 [ -0.00]( 0.34)


     ==================================================================
     Test          : new-schbench-request-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
       1     1.00 [ -0.00]( 2.73)     1.06 [ -5.71]( 0.25)
       2     1.00 [ -0.00]( 0.87)     1.08 [ -8.37]( 0.78)
       4     1.00 [ -0.00]( 1.21)     1.09 [ -9.15]( 0.79)
       8     1.00 [ -0.00]( 0.27)     1.06 [ -6.31]( 0.51)
      16     1.00 [ -0.00]( 4.04)     1.85 [-84.55]( 5.11)
      32     1.00 [ -0.00]( 7.35)     1.52 [-52.16]( 0.83)
      64     1.00 [ -0.00]( 3.54)     1.06 [ -5.77]( 2.62)
     128     1.00 [ -0.00]( 0.37)     1.09 [ -9.18](28.47)
     256     1.00 [ -0.00]( 9.57)     0.99 [  0.60]( 0.48)
     512     1.00 [ -0.00]( 1.82)     1.03 [ -2.80]( 1.16)


     ==================================================================
     Test          : Various longer running benchmarks
     Units         : %diff in throughput reported
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     Benchmarks:                  %diff
     ycsb-cassandra              -0.99%
     ycsb-mongodb                -0.96%
     deathstarbench-1x           -2.09%
     deathstarbench-2x           -0.26%
     deathstarbench-3x           -3.34%
     deathstarbench-6x           -3.03%
     hammerdb+mysql 16VU         -2.15%
     hammerdb+mysql 64VU         -3.77%

> 
> This patch set is applied on v6.15 kernel.
>   
> There are some further work needed for future versions in this
> patch set.  We will need to align NUMA balancing with LLC aggregations
> such that LLC aggregation will align with the preferred NUMA node.
> 
> Comments and tests are much appreciated.

I'll rerun the test once with the SCHED_FEAT() disabled just to make
sure I'm not regressing because of some other factors. For the major
regressions, I'll get the "perf sched stats" data to see if anything
stands out.

I'm also planning on getting the data from a Zen5c system with larger
LLC to see if there is any difference in the trend (I'll start with the
microbenchmarks since setting the larger ones will take some time)

Sorry for the lack of engagement on previous versions but I plan on
taking a better look at the series this time around. If you need any
specific data from my setup, please do let me know.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-24  5:00 ` K Prateek Nayak
@ 2025-06-24 12:16   ` Chen, Yu C
  2025-06-25  4:19     ` K Prateek Nayak
  2025-06-25  0:30   ` Tim Chen
  2025-07-03 20:00   ` Shrikanth Hegde
  2 siblings, 1 reply; 68+ messages in thread
From: Chen, Yu C @ 2025-06-24 12:16 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Tim Chen, Peter Zijlstra, Gautham R . Shenoy,
	Ingo Molnar


On 6/24/2025 1:00 PM, K Prateek Nayak wrote:
> Hello Tim,
> 
> On 6/18/2025 11:57 PM, Tim Chen wrote:
>> AMD Milan is also tested. There are 4 Nodes and 32 CPUs per node.
>> Each node has 4 CCX(shared LLC) and each CCX has 8 CPUs. Hackbench
>> with 1 group test scenario benefits from cache aware load balance
>> too:
>>
>> hackbench(1 group and fd ranges in [1,6]:
>> case                    load            baseline(std%)  compare%( std%)
>> threads-pipe-1          1-groups         1.00 (  1.22)   +2.84 (  0.51)
>> threads-pipe-2          1-groups         1.00 (  5.82)  +42.82 ( 43.61)
>> threads-pipe-3          1-groups         1.00 (  3.49)  +17.33 ( 18.68)
>> threads-pipe-4          1-groups         1.00 (  2.49)  +12.49 (  5.89)
>> threads-pipe-5          1-groups         1.00 (  1.46)   +8.62 (  4.43)
>> threads-pipe-6          1-groups         1.00 (  2.83)  +12.73 (  8.94)
>> threads-sockets-1       1-groups         1.00 (  1.31)  +28.68 (  2.25)
>> threads-sockets-2       1-groups         1.00 (  5.17)  +34.84 ( 36.90)
>> threads-sockets-3       1-groups         1.00 (  1.57)   +9.15 (  5.52)
>> threads-sockets-4       1-groups         1.00 (  1.99)  +16.51 (  6.04)
>> threads-sockets-5       1-groups         1.00 (  2.39)  +10.88 (  2.17)
>> threads-sockets-6       1-groups         1.00 (  1.62)   +7.22 (  2.00)
>>
>> Besides a single instance of hackbench, four instances of hackbench are
>> also tested on Milan. The test results show that different instances of
>> hackbench are aggregated to dedicated LLCs, and performance improvement
>> is observed.
>>
>> schbench mmtests(unstable)
>>                                    baseline              nowake_lb
>> Lat 50.0th-qrtle-1         9.00 (   0.00%)        8.00 (  11.11%)
>> Lat 90.0th-qrtle-1        12.00 (   0.00%)       10.00 (  16.67%)
>> Lat 99.0th-qrtle-1        16.00 (   0.00%)       14.00 (  12.50%)
>> Lat 99.9th-qrtle-1        22.00 (   0.00%)       21.00 (   4.55%)
>> Lat 20.0th-qrtle-1       759.00 (   0.00%)      759.00 (   0.00%)
>> Lat 50.0th-qrtle-2         9.00 (   0.00%)        7.00 (  22.22%)
>> Lat 90.0th-qrtle-2        12.00 (   0.00%)       12.00 (   0.00%)
>> Lat 99.0th-qrtle-2        16.00 (   0.00%)       15.00 (   6.25%)
>> Lat 99.9th-qrtle-2        22.00 (   0.00%)       21.00 (   4.55%)
>> Lat 20.0th-qrtle-2      1534.00 (   0.00%)     1510.00 (   1.56%)
>> Lat 50.0th-qrtle-4         8.00 (   0.00%)        9.00 ( -12.50%)
>> Lat 90.0th-qrtle-4        12.00 (   0.00%)       12.00 (   0.00%)
>> Lat 99.0th-qrtle-4        15.00 (   0.00%)       16.00 (  -6.67%)
>> Lat 99.9th-qrtle-4        21.00 (   0.00%)       23.00 (  -9.52%)
>> Lat 20.0th-qrtle-4      3076.00 (   0.00%)     2860.00 (   7.02%)
>> Lat 50.0th-qrtle-8        10.00 (   0.00%)        9.00 (  10.00%)
>> Lat 90.0th-qrtle-8        12.00 (   0.00%)       13.00 (  -8.33%)
>> Lat 99.0th-qrtle-8        17.00 (   0.00%)       17.00 (   0.00%)
>> Lat 99.9th-qrtle-8        22.00 (   0.00%)       24.00 (  -9.09%)
>> Lat 20.0th-qrtle-8      6232.00 (   0.00%)     5896.00 (   5.39%)
>> Lat 50.0th-qrtle-16        9.00 (   0.00%)        9.00 (   0.00%)
>> Lat 90.0th-qrtle-16       13.00 (   0.00%)       13.00 (   0.00%)
>> Lat 99.0th-qrtle-16       17.00 (   0.00%)       18.00 (  -5.88%)
>> Lat 99.9th-qrtle-16       23.00 (   0.00%)       26.00 ( -13.04%)
>> Lat 20.0th-qrtle-16    10096.00 (   0.00%)    10352.00 (  -2.54%)
>> Lat 50.0th-qrtle-32       15.00 (   0.00%)       15.00 (   0.00%)
>> Lat 90.0th-qrtle-32       25.00 (   0.00%)       26.00 (  -4.00%)
>> Lat 99.0th-qrtle-32       49.00 (   0.00%)       50.00 (  -2.04%)
>> Lat 99.9th-qrtle-32      945.00 (   0.00%)     1005.00 (  -6.35%)
>> Lat 20.0th-qrtle-32    11600.00 (   0.00%)    11632.00 (  -0.28%)
>>
>> Netperf/Tbench have not been tested yet. As they are single-process
>> benchmarks that are not the target of this cache-aware scheduling.
>> Additionally, client and server components should be tested on
>> different machines or bound to different nodes. Otherwise,
>> cache-aware scheduling might harm their performance: placing client
>> and server in the same LLC could yield higher throughput due to
>> improved cache locality in the TCP/IP stack, whereas cache-aware
>> scheduling aims to place them in dedicated LLCs.
> 
> I have similar observation from my testing.
> 

Prateek, thanks for your test.

> tl;dr
> 
> o Benchmark that prefer co-location and run in threaded mode see
>    a benefit including hackbench at high utilization and schbench
>    at low utilization.
> 

Previously, we tested hackbench with one group using different
fd pairs. The number of fds (1–6) was lower than the number
of CPUs (8) within one CCX. If I understand correctly, the
default number of fd pairs in hackbench is 20. We might need
to handle cases where the number of threads (nr_thread)
exceeds the number of CPUs per LLC—perhaps by
skipping task aggregation in such scenarios.

> o schbench (both new and old but particularly the old) regresses
>    quite a bit on the tial latency metric when #workers cross the
>    LLC size.
> 

As mentioned above, maybe re-consider the nr_thread vs nr_cpus_per_llc
could mitigate the issue. Besides, maybe introduce a rate limit
for cache aware aggregation would help.

> o client-server benchmarks where client and servers are threads
>    from different processes (netserver-netperf, tbench_srv-tbench,
>    services of DeathStarBench) seem to noticeably regress due to
>    lack of co-location between the communicating client and server.
> 
>    Not sure if WF_SYNC can be an indicator to temporarily ignore
>    the preferred LLC hint.

WF_SYNC is used in wakeup path, the current v3 version does the
task aggregation in the load balance path. We'll look into this
C/S scenario.

> 
> o stream regresses in some runs where the occupancy metrics trip
>    and assign a preferred LLC for all the stream threads bringing
>    down performance in !50% of the runs.
> 

May I know if you tested the stream with mmtests under OMP mode,
and what do stream-10 and stream-100 mean? Stream is an example
where all threads have their private memory buffers—no
interaction with each other. For this benchmark, spreading
them across different Nodes gets higher memory bandwidth because
stream allocates the buffer to be at least 4X the L3 cache size.
We lack a metric that can indicate when threads share a lot of
data (e.g., both Thread 1 and Thread 2 read from the same
buffer). In such cases, we should aggregate the threads;
otherwise, do not aggregate them (as in the stream case).
On the other hand, stream-omp seems like an unrealistic
scenario—if threads do not share buffer, why create them
in the same process?


> Full data from my testing is as follows:
> 
> o Machine details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> 
> o Kernel details
> 
> tip:      tip:sched/core at commit 914873bc7df9 ("Merge tag
>             'x86-build-2025-05-25' of
>             git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
> 
> llc-aware-lb-v3: tip + this series as is
> 
> o Benchmark results
> 
>      ==================================================================
>      Test          : hackbench
>      Units         : Normalized time in seconds
>      Interpretation: Lower is better
>      Statistic     : AMean
>      ==================================================================
>      Case:           tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>       1-groups     1.00 [ -0.00](13.74)     1.03 [ -2.77](12.01)
>       2-groups     1.00 [ -0.00]( 9.58)     1.02 [ -1.78]( 6.12)
>       4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -0.87]( 0.91)
>       8-groups     1.00 [ -0.00]( 1.51)     1.03 [ -3.31]( 2.06)
>      16-groups     1.00 [ -0.00]( 1.10)     0.95 [  5.36]( 1.67)
> 
> 
>      ==================================================================
>      Test          : tbench
>      Units         : Normalized throughput
>      Interpretation: Higher is better
>      Statistic     : AMean
>      ==================================================================
>      Clients:    tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>          1     1.00 [  0.00]( 0.82)     0.96 [ -3.68]( 1.23)
>          2     1.00 [  0.00]( 1.13)     0.98 [ -2.30]( 0.51)
>          4     1.00 [  0.00]( 1.12)     0.96 [ -4.14]( 0.22)
>          8     1.00 [  0.00]( 0.93)     0.96 [ -3.61]( 0.46)
>         16     1.00 [  0.00]( 0.38)     0.95 [ -4.98]( 1.26)
>         32     1.00 [  0.00]( 0.66)     0.93 [ -7.12]( 2.22)
>         64     1.00 [  0.00]( 1.18)     0.95 [ -5.44]( 0.37)
>        128     1.00 [  0.00]( 1.12)     0.93 [ -6.78]( 0.64)
>        256     1.00 [  0.00]( 0.42)     0.94 [ -6.45]( 0.47)
>        512     1.00 [  0.00]( 0.14)     0.93 [ -7.26]( 0.27)
>       1024     1.00 [  0.00]( 0.26)     0.92 [ -7.57]( 0.31)
> 
> 
>      ==================================================================
>      Test          : stream-10
>      Units         : Normalized Bandwidth, MB/s
>      Interpretation: Higher is better
>      Statistic     : HMean
>      ==================================================================
>      Test:       tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>       Copy     1.00 [  0.00]( 8.37)     0.39 [-61.05](44.88)
>      Scale     1.00 [  0.00]( 2.85)     0.43 [-57.26](40.60)
>        Add     1.00 [  0.00]( 3.39)     0.40 [-59.88](42.02)
>      Triad     1.00 [  0.00]( 6.39)     0.41 [-58.93](42.98)
> 
> 
>      ==================================================================
>      Test          : stream-100
>      Units         : Normalized Bandwidth, MB/s
>      Interpretation: Higher is better
>      Statistic     : HMean
>      ==================================================================
>      Test:       tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>       Copy     1.00 [  0.00]( 3.91)     0.36 [-63.95](51.04)
>      Scale     1.00 [  0.00]( 4.34)     0.40 [-60.31](43.12)
>        Add     1.00 [  0.00]( 4.14)     0.38 [-62.46](43.40)
>      Triad     1.00 [  0.00]( 1.00)     0.36 [-64.38](43.12)
> 
> 
>      ==================================================================
>      Test          : netperf
>      Units         : Normalized Througput
>      Interpretation: Higher is better
>      Statistic     : AMean
>      ==================================================================
>      Clients:         tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>       1-clients     1.00 [  0.00]( 0.41)     0.97 [ -3.26]( 1.30)
>       2-clients     1.00 [  0.00]( 0.58)     0.96 [ -4.24]( 0.71)
>       4-clients     1.00 [  0.00]( 0.35)     0.96 [ -4.19]( 0.67)
>       8-clients     1.00 [  0.00]( 0.48)     0.95 [ -5.41]( 1.36)
>      16-clients     1.00 [  0.00]( 0.66)     0.95 [ -5.31]( 0.93)
>      32-clients     1.00 [  0.00]( 1.15)     0.94 [ -6.43]( 1.44)
>      64-clients     1.00 [  0.00]( 1.38)     0.93 [ -7.14]( 1.63)
>      128-clients    1.00 [  0.00]( 0.87)     0.89 [-10.62]( 0.78)
>      256-clients    1.00 [  0.00]( 5.36)     0.92 [ -8.04]( 2.64)
>      512-clients    1.00 [  0.00](54.39)     0.88 [-12.12](48.87)
> 
> 
>      ==================================================================
>      Test          : schbench
>      Units         : Normalized 99th percentile latency in us
>      Interpretation: Lower is better
>      Statistic     : Median
>      ==================================================================
>      #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>        1     1.00 [ -0.00]( 8.54)     0.54 [ 45.65](28.79)
>        2     1.00 [ -0.00]( 1.15)     0.56 [ 44.00]( 2.09)
>        4     1.00 [ -0.00](13.46)     0.67 [ 33.33](35.68)
>        8     1.00 [ -0.00]( 7.14)     0.63 [ 36.84]( 4.28)
>       16     1.00 [ -0.00]( 3.49)     1.05 [ -5.08]( 9.13)
>       32     1.00 [ -0.00]( 1.06)    32.04 [-3104.26](81.31)
>       64     1.00 [ -0.00]( 5.48)    24.51 [-2351.16](81.18)
>      128     1.00 [ -0.00](10.45)    14.56 [-1356.07]( 5.35)
>      256     1.00 [ -0.00](31.14)     0.95 [  4.80](20.88)
>      512     1.00 [ -0.00]( 1.52)     1.00 [ -0.25]( 1.26)
> 
> 
>      ==================================================================
>      Test          : new-schbench-requests-per-second
>      Units         : Normalized Requests per second
>      Interpretation: Higher is better
>      Statistic     : Median
>      ==================================================================
>      #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>        1     1.00 [  0.00]( 1.07)     0.97 [ -3.24]( 0.98)
>        2     1.00 [  0.00]( 0.00)     0.99 [ -1.17]( 0.15)
>        4     1.00 [  0.00]( 0.00)     0.96 [ -3.50]( 0.56)
>        8     1.00 [  0.00]( 0.15)     0.98 [ -1.76]( 0.31)
>       16     1.00 [  0.00]( 0.00)     0.94 [ -6.13]( 1.93)
>       32     1.00 [  0.00]( 3.41)     0.97 [ -3.18]( 2.10)
>       64     1.00 [  0.00]( 1.05)     0.82 [-18.14](18.41)
>      128     1.00 [  0.00]( 0.00)     0.98 [ -2.27]( 0.20)
>      256     1.00 [  0.00]( 0.72)     1.01 [  1.23]( 0.31)
>      512     1.00 [  0.00]( 0.57)     1.00 [  0.00]( 0.12)
> 
> 
>      ==================================================================
>      Test          : new-schbench-wakeup-latency
>      Units         : Normalized 99th percentile latency in us
>      Interpretation: Lower is better
>      Statistic     : Median
>      ==================================================================
>      #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>        1     1.00 [ -0.00]( 9.11)     0.88 [ 12.50](11.92)
>        2     1.00 [ -0.00]( 0.00)     0.86 [ 14.29](11.92)
>        4     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 4.08)
>        8     1.00 [ -0.00]( 0.00)     0.83 [ 16.67]( 5.34)
>       16     1.00 [ -0.00]( 7.56)     0.85 [ 15.38]( 0.00)
>       32     1.00 [ -0.00](15.11)     0.80 [ 20.00]( 4.19)
>       64     1.00 [ -0.00]( 9.63)     1.05 [ -5.00](24.47)
>      128     1.00 [ -0.00]( 4.86)     1.57 [-56.78](68.52)
>      256     1.00 [ -0.00]( 2.34)     1.00 [ -0.00]( 0.57)
>      512     1.00 [ -0.00]( 0.40)     1.00 [ -0.00]( 0.34)
> 
> 
>      ==================================================================
>      Test          : new-schbench-request-latency
>      Units         : Normalized 99th percentile latency in us
>      Interpretation: Lower is better
>      Statistic     : Median
>      ==================================================================
>      #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
>        1     1.00 [ -0.00]( 2.73)     1.06 [ -5.71]( 0.25)
>        2     1.00 [ -0.00]( 0.87)     1.08 [ -8.37]( 0.78)
>        4     1.00 [ -0.00]( 1.21)     1.09 [ -9.15]( 0.79)
>        8     1.00 [ -0.00]( 0.27)     1.06 [ -6.31]( 0.51)
>       16     1.00 [ -0.00]( 4.04)     1.85 [-84.55]( 5.11)
>       32     1.00 [ -0.00]( 7.35)     1.52 [-52.16]( 0.83)
>       64     1.00 [ -0.00]( 3.54)     1.06 [ -5.77]( 2.62)
>      128     1.00 [ -0.00]( 0.37)     1.09 [ -9.18](28.47)
>      256     1.00 [ -0.00]( 9.57)     0.99 [  0.60]( 0.48)
>      512     1.00 [ -0.00]( 1.82)     1.03 [ -2.80]( 1.16)
> 
> 
>      ==================================================================
>      Test          : Various longer running benchmarks
>      Units         : %diff in throughput reported
>      Interpretation: Higher is better
>      Statistic     : Median
>      ==================================================================
>      Benchmarks:                  %diff
>      ycsb-cassandra              -0.99%
>      ycsb-mongodb                -0.96%
>      deathstarbench-1x           -2.09%
>      deathstarbench-2x           -0.26%
>      deathstarbench-3x           -3.34%
>      deathstarbench-6x           -3.03%
>      hammerdb+mysql 16VU         -2.15%
>      hammerdb+mysql 64VU         -3.77%
> 
>>
>> This patch set is applied on v6.15 kernel.
>> There are some further work needed for future versions in this
>> patch set.  We will need to align NUMA balancing with LLC aggregations
>> such that LLC aggregation will align with the preferred NUMA node.
>>
>> Comments and tests are much appreciated.
> 
> I'll rerun the test once with the SCHED_FEAT() disabled just to make
> sure I'm not regressing because of some other factors. For the major
> regressions, I'll get the "perf sched stats" data to see if anything
> stands out.

It seems that task migration and task bouncing between its preferred
LLC and non-preferred LLC is one symptom that caused regression.

thanks,
Chenyu

> 
> I'm also planning on getting the data from a Zen5c system with larger
> LLC to see if there is any difference in the trend (I'll start with the
> microbenchmarks since setting the larger ones will take some time)
> 
> Sorry for the lack of engagement on previous versions but I plan on
> taking a better look at the series this time around. If you need any
> specific data from my setup, please do let me know.
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-22  0:39   ` Chen, Yu C
@ 2025-06-24 17:47     ` Madadi Vineeth Reddy
  0 siblings, 0 replies; 68+ messages in thread
From: Madadi Vineeth Reddy @ 2025-06-24 17:47 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Hillf Danton, Len Brown, linux-kernel,
	Madadi Vineeth Reddy, Tim Chen, Peter Zijlstra, Ingo Molnar,
	K Prateek Nayak, Gautham R . Shenoy

Hi Chen,

On 22/06/25 06:09, Chen, Yu C wrote:
> On 6/21/2025 3:25 AM, Madadi Vineeth Reddy wrote:
>> Hi Tim,
>>
>> On 18/06/25 23:57, Tim Chen wrote:
>>> This is the third revision of the cache aware scheduling patches,
>>> based on the original patch proposed by Peter[1].
>>>   The goal of the patch series is to aggregate tasks sharing data
>>> to the same cache domain, thereby reducing cache bouncing and
>>> cache misses, and improve data access efficiency. In the current
>>> implementation, threads within the same process are considered
>>> as entities that potentially share resources.
>>>   In previous versions, aggregation of tasks were done in the
>>> wake up path, without making load balancing paths aware of
>>> LLC (Last-Level-Cache) preference. This led to the following
>>> problems:
>>>
>>> 1) Aggregation of tasks during wake up led to load imbalance
>>>     between LLCs
>>> 2) Load balancing tried to even out the load between LLCs
>>> 3) Wake up tasks aggregation happened at a faster rate and
>>>     load balancing moved tasks in opposite directions, leading
>>>     to continuous and excessive task migrations and regressions
>>>     in benchmarks like schbench.
>>>
>>> In this version, load balancing is made cache-aware. The main
>>> idea of cache-aware load balancing consists of two parts:
>>>
>>> 1) Identify tasks that prefer to run on their hottest LLC and
>>>     move them there.
>>> 2) Prevent generic load balancing from moving a task out of
>>>     its hottest LLC.
>>>
>>> By default, LLC task aggregation during wake-up is disabled.
>>> Conversely, cache-aware load balancing is enabled by default.
>>> For easier comparison, two scheduler features are introduced:
>>> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
>>> wake up and cache-aware load balancing, respectively. By default,
>>> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
>>> is only done on load balancing.
>>
>> Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
>> LLC on this platform spans 4 threads.
>>
>> schbench:
>>                          baseline (sd%)        baseline+cacheaware (sd%)      %change
>> Lat 50.0th-worker-1        6.33 (24.12%)           6.00 (28.87%)               5.21%
>> Lat 90.0th-worker-1        7.67 ( 7.53%)           7.67 (32.83%)               0.00%
>> Lat 99.0th-worker-1        8.67 ( 6.66%)           9.33 (37.63%)              -7.61%
>> Lat 99.9th-worker-1       21.33 (63.99%)          12.33 (28.47%)              42.19%
>>
>> Lat 50.0th-worker-2        4.33 (13.32%)           5.67 (10.19%)             -30.95%
>> Lat 90.0th-worker-2        5.67 (20.38%)           7.67 ( 7.53%)             -35.27%
>> Lat 99.0th-worker-2        7.33 ( 7.87%)           8.33 ( 6.93%)             -13.64%
>> Lat 99.9th-worker-2       11.67 (24.74%)          10.33 (11.17%)              11.48%
>>
>> Lat 50.0th-worker-4        5.00 ( 0.00%)           7.00 ( 0.00%)             -40.00%
>> Lat 90.0th-worker-4        7.00 ( 0.00%)           9.67 ( 5.97%)             -38.14%
>> Lat 99.0th-worker-4        8.00 ( 0.00%)          11.33 (13.48%)             -41.62%
>> Lat 99.9th-worker-4       10.33 ( 5.59%)          14.00 ( 7.14%)             -35.53%
>>
>> Lat 50.0th-worker-8        4.33 (13.32%)           5.67 (10.19%)             -30.95%
>> Lat 90.0th-worker-8        6.33 (18.23%)           8.67 ( 6.66%)             -36.99%
>> Lat 99.0th-worker-8        7.67 ( 7.53%)          10.33 ( 5.59%)             -34.69%
>> Lat 99.9th-worker-8       10.00 (10.00%)          12.33 ( 4.68%)             -23.30%
>>
>> Lat 50.0th-worker-16       4.00 ( 0.00%)           5.00 ( 0.00%)             -25.00%
>> Lat 90.0th-worker-16       6.33 ( 9.12%)           7.67 ( 7.53%)             -21.21%
>> Lat 99.0th-worker-16       8.00 ( 0.00%)          10.33 ( 5.59%)             -29.13%
>> Lat 99.9th-worker-16      12.00 ( 8.33%)          13.33 ( 4.33%)             -11.08%
>>
>> Lat 50.0th-worker-32       5.00 ( 0.00%)           5.33 (10.83%)              -6.60%
>> Lat 90.0th-worker-32       7.00 ( 0.00%)           8.67 (17.63%)             -23.86%
>> Lat 99.0th-worker-32      10.67 (14.32%)          12.67 ( 4.56%)             -18.75%
>> Lat 99.9th-worker-32      14.67 ( 3.94%)          19.00 (13.93%)             -29.49%
>>
>> Lat 50.0th-worker-64       5.33 (10.83%)           6.67 ( 8.66%)             -25.14%
>> Lat 90.0th-worker-64      10.00 (17.32%)          14.33 ( 4.03%)             -43.30%
>> Lat 99.0th-worker-64      14.00 ( 7.14%)          16.67 ( 3.46%)             -19.07%
>> Lat 99.9th-worker-64      55.00 (56.69%)          47.00 (61.92%)              14.55%
>>
>> Lat 50.0th-worker-128      8.00 ( 0.00%)           8.67 (13.32%)              -8.38%
>> Lat 90.0th-worker-128     13.33 ( 4.33%)          14.33 ( 8.06%)              -7.50%
>> Lat 99.0th-worker-128     16.00 ( 0.00%)          20.00 ( 8.66%)             -25.00%
>> Lat 99.9th-worker-128   2258.33 (83.80%)        2974.67 (21.82%)             -31.72%
>>
>> Lat 50.0th-worker-256     47.67 ( 2.42%)          45.33 ( 3.37%)               4.91%
>> Lat 90.0th-worker-256   3470.67 ( 1.88%)        3558.67 ( 0.47%)              -2.54%
>> Lat 99.0th-worker-256   9040.00 ( 2.76%)        9050.67 ( 0.41%)              -0.12%
>> Lat 99.9th-worker-256  13824.00 (20.07%)       13104.00 ( 6.84%)               5.21%
>>
>> The above data shows mostly regression both in the lesser and
>> higher load cases.
>>
>>
>> Hackbench pipe:
>>
>> Pairs   Baseline Avg (s) (Std%)     Patched Avg (s) (Std%)      % Change
>> 2       2.987 (1.19%)               2.414 (17.99%)              24.06%
>> 4       7.702 (12.53%)              7.228 (18.37%)               6.16%
>> 8       14.141 (1.32%)              13.109 (1.46%)               7.29%
>> 15      27.571 (6.53%)              29.460 (8.71%)              -6.84%
>> 30      65.118 (4.49%)              61.352 (4.00%)               5.78%
>> 45      105.086 (9.75%)             97.970 (4.26%)               6.77%
>> 60      149.221 (6.91%)             154.176 (4.17%)             -3.32%
>> 75      199.278 (1.21%)             198.680 (1.37%)              0.30%
>>
>> A lot of run to run variation is seen in hackbench runs. So hard to tell
>> on the performance but looks better than schbench.
> 
> May I know if the cpu frequency was set at a fixed level and deep
> cpu idle states were disabled(I assume on power system it is called
> stop states?)

Deep cpu idle state is called 'cede' in PowerVM LPAR. I have not disabled
it.

> 
>>
>> In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
>> when compared to platforms like sapphire rapids and Milan. Didn't go
>> through this series yet. Will go through and try to understand why
>> schbench is not happy on Power systems.
>>
>> Meanwhile, Wanted to know your thoughts on how does smaller LLC
>> size get impacted with this patch?
>>
> 
> task aggregation on smaller LLC domain(both in terms of the
> number of CPUs and the size of LLC) might bring cache contention
> and hurt performance IMO. May I know what is the cache size on
> your system:
> lscpu | grep "L3 cache"

L3 cache: 224 MiB (56 instances)

> 
> May I know if you tested it with:
> echo NO_SCHED_CACHE > /sys/kernel/debug/sched/features
> echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
> echo NO_SCHED_CACHE_LB > /sys/kernel/debug/sched/features
> 
> vs
> 
> echo SCHED_CACHE > /sys/kernel/debug/sched/features
> echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
> echo SCHED_CACHE_LB > /sys/kernel/debug/sched/features
> 

I have tested with and without this patch series. Didn't change
any sched feature. So, the patched kernel was running with the default
settings:
SCHED_CACHE, NO_SCHED_CACHE_WAKE, and SCHED_CACHE_LB.


> And could you help check if setting /sys/kernel/debug/sched/llc_aggr_cap
> from 50 to some smaller values(25, etc) would help?

Will give it a try.

Thanks,
Madadi Vineeth Reddy

> 
> thanks,
> Chenyu
> 
>> Thanks,
>> Madadi Vineeth Reddy
>>
>>
>>>
>>> With above default settings, task migrations occur less frequently
>>> and no longer happen in the latency-sensitive wake-up path.
>>>
>>
>> [..snip..]
>>
>>>
>>> Chen Yu (3):
>>>    sched: Several fixes for cache aware scheduling
>>>    sched: Avoid task migration within its preferred LLC
>>>    sched: Save the per LLC utilization for better cache aware scheduling
>>>
>>> K Prateek Nayak (1):
>>>    sched: Avoid calculating the cpumask if the system is overloaded
>>>
>>> Peter Zijlstra (1):
>>>    sched: Cache aware load-balancing
>>>
>>> Tim Chen (15):
>>>    sched: Add hysteresis to switch a task's preferred LLC
>>>    sched: Add helper function to decide whether to allow cache aware
>>>      scheduling
>>>    sched: Set up LLC indexing
>>>    sched: Introduce task preferred LLC field
>>>    sched: Calculate the number of tasks that have LLC preference on a
>>>      runqueue
>>>    sched: Introduce per runqueue task LLC preference counter
>>>    sched: Calculate the total number of preferred LLC tasks during load
>>>      balance
>>>    sched: Tag the sched group as llc_balance if it has tasks prefer other
>>>      LLC
>>>    sched: Introduce update_llc_busiest() to deal with groups having
>>>      preferred LLC tasks
>>>    sched: Introduce a new migration_type to track the preferred LLC load
>>>      balance
>>>    sched: Consider LLC locality for active balance
>>>    sched: Consider LLC preference when picking tasks from busiest queue
>>>    sched: Do not migrate task if it is moving out of its preferred LLC
>>>    sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>>>    sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>>>      up
>>>
>>>   include/linux/mm_types.h       |  44 ++
>>>   include/linux/sched.h          |   8 +
>>>   include/linux/sched/topology.h |   3 +
>>>   init/Kconfig                   |   4 +
>>>   init/init_task.c               |   3 +
>>>   kernel/fork.c                  |   5 +
>>>   kernel/sched/core.c            |  25 +-
>>>   kernel/sched/debug.c           |   4 +
>>>   kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
>>>   kernel/sched/features.h        |   3 +
>>>   kernel/sched/sched.h           |  23 +
>>>   kernel/sched/topology.c        |  29 ++
>>>   12 files changed, 982 insertions(+), 28 deletions(-)
>>>
>>


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-24  5:00 ` K Prateek Nayak
  2025-06-24 12:16   ` Chen, Yu C
@ 2025-06-25  0:30   ` Tim Chen
  2025-06-25  4:30     ` K Prateek Nayak
  2025-07-03 20:00   ` Shrikanth Hegde
  2 siblings, 1 reply; 68+ messages in thread
From: Tim Chen @ 2025-06-25  0:30 UTC (permalink / raw)
  To: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Chen Yu

On Tue, 2025-06-24 at 10:30 +0530, K Prateek Nayak wrote:
> Hello Tim,
> 
> I have similar observation from my testing.
> 
> 
Prateek,

Thanks for the testing that you did. Much appreciated.
Some follow up to Chen, Yu's comments.

> 
> o Benchmark that prefer co-location and run in threaded mode see
>    a benefit including hackbench at high utilization and schbench
>    at low utilization.
> 
> o schbench (both new and old but particularly the old) regresses
>    quite a bit on the tial latency metric when #workers cross the
>    LLC size.

Will take closer look at the cases where #workers just exceed LLC size.
Perhaps adjusting the threshold to spread the load earlier at a
lower LLC utilization will help.

> 
> o client-server benchmarks where client and servers are threads
>    from different processes (netserver-netperf, tbench_srv-tbench,
>    services of DeathStarBench) seem to noticeably regress due to
>    lack of co-location between the communicating client and server.
> 
>    Not sure if WF_SYNC can be an indicator to temporarily ignore
>    the preferred LLC hint.

Currently we do not aggregate tasks from different processes.
The case where client and server actually reside on the same
system I think is the exception rather than the rule for real
workloads where clients and servers reside on different systems.

But I do see tasks from different processes talking to each
other via pipe/socket in real workload.  Do you know of good
use cases for such scenario that would justify extending task
aggregation to multi-processes?
 
> 
> o stream regresses in some runs where the occupancy metrics trip
>    and assign a preferred LLC for all the stream threads bringing
>    down performance in !50% of the runs.
> 

Yes, stream does not have cache benefit from co-locating threads, and
get hurt from sharing common resource like memory controller.


> Full data from my testing is as follows:
> 
> o Machine details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> 
> 
>      ==================================================================
>      Test          : Various longer running benchmarks
>      Units         : %diff in throughput reported
>      Interpretation: Higher is better
>      Statistic     : Median
>      ==================================================================
>      Benchmarks:                  %diff
>      ycsb-cassandra              -0.99%
>      ycsb-mongodb                -0.96%
>      deathstarbench-1x           -2.09%
>      deathstarbench-2x           -0.26%
>      deathstarbench-3x           -3.34%
>      deathstarbench-6x           -3.03%
>      hammerdb+mysql 16VU         -2.15%
>      hammerdb+mysql 64VU         -3.77%
> 

The clients and server of the benchmarks are co-located on the same
system, right?

> > 
> > This patch set is applied on v6.15 kernel.
> >   
> > There are some further work needed for future versions in this
> > patch set.  We will need to align NUMA balancing with LLC aggregations
> > such that LLC aggregation will align with the preferred NUMA node.
> > 
> > Comments and tests are much appreciated.
> 
> I'll rerun the test once with the SCHED_FEAT() disabled just to make
> sure I'm not regressing because of some other factors. For the major
> regressions, I'll get the "perf sched stats" data to see if anything
> stands out.
> 
> I'm also planning on getting the data from a Zen5c system with larger
> LLC to see if there is any difference in the trend (I'll start with the
> microbenchmarks since setting the larger ones will take some time)
> 
> Sorry for the lack of engagement on previous versions but I plan on
> taking a better look at the series this time around. If you need any
> specific data from my setup, please do let me know.
> 

Will do.  Thanks.

Tim

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-24 12:16   ` Chen, Yu C
@ 2025-06-25  4:19     ` K Prateek Nayak
  0 siblings, 0 replies; 68+ messages in thread
From: K Prateek Nayak @ 2025-06-25  4:19 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Tim Chen, Peter Zijlstra, Gautham R . Shenoy,
	Ingo Molnar

Hello Chenyu,

On 6/24/2025 5:46 PM, Chen, Yu C wrote:

[..snip..]

>> tl;dr
>>
>> o Benchmark that prefer co-location and run in threaded mode see
>>    a benefit including hackbench at high utilization and schbench
>>    at low utilization.
>>
> 
> Previously, we tested hackbench with one group using different
> fd pairs. The number of fds (1–6) was lower than the number
> of CPUs (8) within one CCX. If I understand correctly, the
> default number of fd pairs in hackbench is 20.

Yes that is correct. I'm using the default configuration with
20 messengers are 20 receivers over 20 fd pairs. I'll check
if changing this to nr_llc and nr_llc / 2 makes a difference.

> We might need
> to handle cases where the number of threads (nr_thread)
> exceeds the number of CPUs per LLC—perhaps by
> skipping task aggregation in such scenarios.
> 
>> o schbench (both new and old but particularly the old) regresses
>>    quite a bit on the tial latency metric when #workers cross the
>>    LLC size.
>>
> 
> As mentioned above, maybe re-consider the nr_thread vs nr_cpus_per_llc
> could mitigate the issue. Besides, maybe introduce a rate limit
> for cache aware aggregation would help.
> 
>> o client-server benchmarks where client and servers are threads
>>    from different processes (netserver-netperf, tbench_srv-tbench,
>>    services of DeathStarBench) seem to noticeably regress due to
>>    lack of co-location between the communicating client and server.
>>
>>    Not sure if WF_SYNC can be an indicator to temporarily ignore
>>    the preferred LLC hint.
> 
> WF_SYNC is used in wakeup path, the current v3 version does the
> task aggregation in the load balance path. We'll look into this
> C/S scenario.
> 
>>
>> o stream regresses in some runs where the occupancy metrics trip
>>    and assign a preferred LLC for all the stream threads bringing
>>    down performance in !50% of the runs.
>>
> 
> May I know if you tested the stream with mmtests under OMP mode,
> and what do stream-10 and stream-100 mean?

I'm using STREAM in OMP mode. The "10" and "100" refer to the
"NTIMES" argument. I'm passing this during the time of binary
creation as:

     gcc -DSTREAM_ARRAY_SIZE=$ARRAY_SIZE -DNTIMES=$NUM_TIMES -fopenmp -O2 stream.c -o stream

This repeats the main loop of stream benchmark NTIMES. 10 runs
is used to spot any imbalances for shorter runs of b/w intensive
tasks and 100 runs are used to spot trends / ability to correct
an incorrect placement over a longer run.

> Stream is an example
> where all threads have their private memory buffers—no
> interaction with each other. For this benchmark, spreading
> them across different Nodes gets higher memory bandwidth because
> stream allocates the buffer to be at least 4X the L3 cache size.
> We lack a metric that can indicate when threads share a lot of
> data (e.g., both Thread 1 and Thread 2 read from the same
> buffer). In such cases, we should aggregate the threads;
> otherwise, do not aggregate them (as in the stream case).
> On the other hand, stream-omp seems like an unrealistic
> scenario—if threads do not share buffer, why create them
> in the same process?

Not very sure why that is the case but from what I know, HPC
heavily relies on OMP and I believe using threads can reduce
the overhead of fork + join when amount of parallelism in
OMP loops vary?

[..snip..]

>>
>>>
>>> This patch set is applied on v6.15 kernel.
>>> There are some further work needed for future versions in this
>>> patch set.  We will need to align NUMA balancing with LLC aggregations
>>> such that LLC aggregation will align with the preferred NUMA node.
>>>
>>> Comments and tests are much appreciated.
>>
>> I'll rerun the test once with the SCHED_FEAT() disabled just to make
>> sure I'm not regressing because of some other factors. For the major
>> regressions, I'll get the "perf sched stats" data to see if anything
>> stands out.
> 
> It seems that task migration and task bouncing between its preferred
> LLC and non-preferred LLC is one symptom that caused regression.

That could be the case! I'll also include some migration data to see
if it reveals anything.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-25  0:30   ` Tim Chen
@ 2025-06-25  4:30     ` K Prateek Nayak
  0 siblings, 0 replies; 68+ messages in thread
From: K Prateek Nayak @ 2025-06-25  4:30 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Chen Yu

Hello Tim,

On 6/25/2025 6:00 AM, Tim Chen wrote:
>> o Benchmark that prefer co-location and run in threaded mode see
>>     a benefit including hackbench at high utilization and schbench
>>     at low utilization.
>>
>> o schbench (both new and old but particularly the old) regresses
>>     quite a bit on the tial latency metric when #workers cross the
>>     LLC size.
> 
> Will take closer look at the cases where #workers just exceed LLC size.
> Perhaps adjusting the threshold to spread the load earlier at a
> lower LLC utilization will help.

I too will test with different number of fd pairs to see if I can
spot a trend.

> 
>>
>> o client-server benchmarks where client and servers are threads
>>     from different processes (netserver-netperf, tbench_srv-tbench,
>>     services of DeathStarBench) seem to noticeably regress due to
>>     lack of co-location between the communicating client and server.
>>
>>     Not sure if WF_SYNC can be an indicator to temporarily ignore
>>     the preferred LLC hint.
> 
> Currently we do not aggregate tasks from different processes.
> The case where client and server actually reside on the same
> system I think is the exception rather than the rule for real
> workloads where clients and servers reside on different systems.
> 
> But I do see tasks from different processes talking to each
> other via pipe/socket in real workload.  Do you know of good
> use cases for such scenario that would justify extending task
> aggregation to multi-processes?

We've seen cases with Kubernetes deployments where co-locating
processes of different services from the same pod can help with
throughput and latency. Perhaps it can happen indirectly where
co-location on WF_SYNC can actually help increase the cache
occupancy for a the other process and they both arrive at the
same preferred LLC. I'll see if I can get my hands on a setup
which is closer to these real world deployment.

>   
>>
>> o stream regresses in some runs where the occupancy metrics trip
>>     and assign a preferred LLC for all the stream threads bringing
>>     down performance in !50% of the runs.
>>
> 
> Yes, stream does not have cache benefit from co-locating threads, and
> get hurt from sharing common resource like memory controller.
> 
> 
>> Full data from my testing is as follows:
>>
>> o Machine details
>>
>> - 3rd Generation EPYC System
>> - 2 sockets each with 64C/128T
>> - NPS1 (Each socket is a NUMA node)
>> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
>>
>>
>>       ==================================================================
>>       Test          : Various longer running benchmarks
>>       Units         : %diff in throughput reported
>>       Interpretation: Higher is better
>>       Statistic     : Median
>>       ==================================================================
>>       Benchmarks:                  %diff
>>       ycsb-cassandra              -0.99%
>>       ycsb-mongodb                -0.96%
>>       deathstarbench-1x           -2.09%
>>       deathstarbench-2x           -0.26%
>>       deathstarbench-3x           -3.34%
>>       deathstarbench-6x           -3.03%
>>       hammerdb+mysql 16VU         -2.15%
>>       hammerdb+mysql 64VU         -3.77%
>>
> 
> The clients and server of the benchmarks are co-located on the same
> system, right?

Yes that is correct. I'm using a 2P systems and our runner scripts
pin the workload to the first socket, and the workload driver runs
from the second socket. One side effect of this is that changes can
influence the placement of workload driver and that can lead to
some inconsistencies. I'll check if the the stats for the workload
driver is way off between the baseline and with this series.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 01/20] sched: Cache aware load-balancing
  2025-06-18 18:27 ` [RFC patch v3 01/20] sched: Cache aware load-balancing Tim Chen
@ 2025-06-26 12:23   ` Jianyong Wu
  2025-06-26 13:32     ` Chen, Yu C
  2025-07-03 19:29   ` Shrikanth Hegde
  1 sibling, 1 reply; 68+ messages in thread
From: Jianyong Wu @ 2025-06-26 12:23 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Chen Yu

Hi Tim,

On 6/19/2025 2:27 AM, Tim Chen wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> Hi all,
> 
> One of the many things on the eternal todo list has been finishing the
> below hackery.
> 
> It is an attempt at modelling cache affinity -- and while the patch
> really only targets LLC, it could very well be extended to also apply to
> clusters (L2). Specifically any case of multiple cache domains inside a
> node.
> 
> Anyway, I wrote this about a year ago, and I mentioned this at the
> recent OSPM conf where Gautham and Prateek expressed interest in playing
> with this code.
> 
> So here goes, very rough and largely unproven code ahead :-)
> 
> It applies to current tip/master, but I know it will fail the __percpu
> validation that sits in -next, although that shouldn't be terribly hard
> to fix up.
> 
> As is, it only computes a CPU inside the LLC that has the highest recent
> runtime, this CPU is then used in the wake-up path to steer towards this
> LLC and in task_hot() to limit migrations away from it.
> 
> More elaborate things could be done, notably there is an XXX in there
> somewhere about finding the best LLC inside a NODE (interaction with
> NUMA_BALANCING).
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   include/linux/mm_types.h |  44 ++++++
>   include/linux/sched.h    |   4 +
>   init/Kconfig             |   4 +
>   kernel/fork.c            |   5 +
>   kernel/sched/core.c      |  13 +-
>   kernel/sched/fair.c      | 330 +++++++++++++++++++++++++++++++++++++--
>   kernel/sched/sched.h     |   8 +
>   7 files changed, 388 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 56d07edd01f9..013291c6aaa2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -893,6 +893,12 @@ struct mm_cid {
>   };
>   #endif
>   

> +static void task_cache_work(struct callback_head *work)
> +{
> +	struct task_struct *p = current;
> +	struct mm_struct *mm = p->mm;
> +	unsigned long m_a_occ = 0;
> +	int cpu, m_a_cpu = -1;
> +	cpumask_var_t cpus;
> +
> +	WARN_ON_ONCE(work != &p->cache_work);
> +
> +	work->next = work;
> +
> +	if (p->flags & PF_EXITING)
> +		return;
> +
> +	if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
> +		return;
> +
> +	scoped_guard (cpus_read_lock) {
> +		cpumask_copy(cpus, cpu_online_mask);
> +
> +		for_each_cpu(cpu, cpus) {
> +			/* XXX sched_cluster_active */
> +			struct sched_domain *sd = per_cpu(sd_llc, cpu);
> +			unsigned long occ, m_occ = 0, a_occ = 0;
> +			int m_cpu = -1, nr = 0, i;
> +
> +			for_each_cpu(i, sched_domain_span(sd)) {
> +				occ = fraction_mm_sched(cpu_rq(i),
> +							per_cpu_ptr(mm->pcpu_sched, i));
> +				a_occ += occ;
> +				if (occ > m_occ) {
> +					m_occ = occ;
> +					m_cpu = i;
> +				}
> +				nr++;
> +				trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: %d\n",
> +					     per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
> +			}
> +
> +			a_occ /= nr;
> +			if (a_occ > m_a_occ) {
> +				m_a_occ = a_occ;
> +				m_a_cpu = m_cpu;
> +			}
> +
> +			trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
> +				     per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
> +
> +			for_each_cpu(i, sched_domain_span(sd)) {
> +				/* XXX threshold ? */
> +				per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
> +			}
> +
> +			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
> +		}
> +	}
> +
> +	/*
> +	 * If the max average cache occupancy is 'small' we don't care.
> +	 */
> +	if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
> +		m_a_cpu = -1;
> +
> +	mm->mm_sched_cpu = m_a_cpu;
> +
> +	free_cpumask_var(cpus);
> +}
> +

This task work may take a long time for the system with large number 
cpus which increacing the delay for process back to userspace. It may be 
the reason that schbench benchmark regressed so much.

To avoid searching the whole system, what about just searching the 
preferred numa node provided by numa balancing if there is one. If not, 
then fallback to search the whole system or just search the numa node 
where the main process locates as there is a high probability it 
contains the preferred LLC. In other words, we can opt for a suboptimal 
LLC location to prioritize speed.

WDYT?

Thanks
Jianyong

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 01/20] sched: Cache aware load-balancing
  2025-06-26 12:23   ` Jianyong Wu
@ 2025-06-26 13:32     ` Chen, Yu C
  2025-06-27  0:10       ` Tim Chen
  0 siblings, 1 reply; 68+ messages in thread
From: Chen, Yu C @ 2025-06-26 13:32 UTC (permalink / raw)
  To: Jianyong Wu
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Gautham R . Shenoy, Ingo Molnar, Tim Chen,
	K Prateek Nayak, Peter Zijlstra

On 6/26/2025 8:23 PM, Jianyong Wu wrote:
> Hi Tim,
> 
> On 6/19/2025 2:27 AM, Tim Chen wrote:
>> From: Peter Zijlstra <peterz@infradead.org>
>>
>> Hi all,
>>
>> One of the many things on the eternal todo list has been finishing the
>> below hackery.
>>
>> It is an attempt at modelling cache affinity -- and while the patch
>> really only targets LLC, it could very well be extended to also apply to
>> clusters (L2). Specifically any case of multiple cache domains inside a
>> node.
>>
>> Anyway, I wrote this about a year ago, and I mentioned this at the
>> recent OSPM conf where Gautham and Prateek expressed interest in playing
>> with this code.
>>
>> So here goes, very rough and largely unproven code ahead :-)
>>
>> It applies to current tip/master, but I know it will fail the __percpu
>> validation that sits in -next, although that shouldn't be terribly hard
>> to fix up.
>>
>> As is, it only computes a CPU inside the LLC that has the highest recent
>> runtime, this CPU is then used in the wake-up path to steer towards this
>> LLC and in task_hot() to limit migrations away from it.
>>
>> More elaborate things could be done, notably there is an XXX in there
>> somewhere about finding the best LLC inside a NODE (interaction with
>> NUMA_BALANCING).
>>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> ---
>>   include/linux/mm_types.h |  44 ++++++
>>   include/linux/sched.h    |   4 +
>>   init/Kconfig             |   4 +
>>   kernel/fork.c            |   5 +
>>   kernel/sched/core.c      |  13 +-
>>   kernel/sched/fair.c      | 330 +++++++++++++++++++++++++++++++++++++--
>>   kernel/sched/sched.h     |   8 +
>>   7 files changed, 388 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 56d07edd01f9..013291c6aaa2 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -893,6 +893,12 @@ struct mm_cid {
>>   };
>>   #endif
> 
>> +static void task_cache_work(struct callback_head *work)
>> +{
>> +    struct task_struct *p = current;
>> +    struct mm_struct *mm = p->mm;
>> +    unsigned long m_a_occ = 0;
>> +    int cpu, m_a_cpu = -1;
>> +    cpumask_var_t cpus;
>> +
>> +    WARN_ON_ONCE(work != &p->cache_work);
>> +
>> +    work->next = work;
>> +
>> +    if (p->flags & PF_EXITING)
>> +        return;
>> +
>> +    if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
>> +        return;
>> +
>> +    scoped_guard (cpus_read_lock) {
>> +        cpumask_copy(cpus, cpu_online_mask);
>> +
>> +        for_each_cpu(cpu, cpus) {
>> +            /* XXX sched_cluster_active */
>> +            struct sched_domain *sd = per_cpu(sd_llc, cpu);
>> +            unsigned long occ, m_occ = 0, a_occ = 0;
>> +            int m_cpu = -1, nr = 0, i;
>> +
>> +            for_each_cpu(i, sched_domain_span(sd)) {
>> +                occ = fraction_mm_sched(cpu_rq(i),
>> +                            per_cpu_ptr(mm->pcpu_sched, i));
>> +                a_occ += occ;
>> +                if (occ > m_occ) {
>> +                    m_occ = occ;
>> +                    m_cpu = i;
>> +                }
>> +                nr++;
>> +                trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: 
>> %d\n",
>> +                         per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
>> +            }
>> +
>> +            a_occ /= nr;
>> +            if (a_occ > m_a_occ) {
>> +                m_a_occ = a_occ;
>> +                m_a_cpu = m_cpu;
>> +            }
>> +
>> +            trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
>> +                     per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
>> +
>> +            for_each_cpu(i, sched_domain_span(sd)) {
>> +                /* XXX threshold ? */
>> +                per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
>> +            }
>> +
>> +            cpumask_andnot(cpus, cpus, sched_domain_span(sd));
>> +        }
>> +    }
>> +
>> +    /*
>> +     * If the max average cache occupancy is 'small' we don't care.
>> +     */
>> +    if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
>> +        m_a_cpu = -1;
>> +
>> +    mm->mm_sched_cpu = m_a_cpu;
>> +
>> +    free_cpumask_var(cpus);
>> +}
>> +
> 
> This task work may take a long time for the system with large number 
> cpus which increacing the delay for process back to userspace. It may be 
> the reason that schbench benchmark regressed so much.
> 

Thanks for the insight Jianyong, yes, the scan on all online CPUs would
be costly.

> To avoid searching the whole system, what about just searching the 
> preferred numa node provided by numa balancing if there is one. If not, 
> then fallback to search the whole system or just search the numa node 
> where the main process locates as there is a high probability it 
> contains the preferred LLC. In other words, we can opt for a suboptimal 
> LLC location to prioritize speed.
> 
> WDYT?
> 

This is a good idea. Previously, Tim had a version that dealt with a
similar scenario, which only scanned the CPUs within p's preferred node.
  However, it seems to cause bouncing of the mm->mm_sched_cpu because we
set a 2X threshold for switching the mm->mm_sched_cpu in patch 5. If the
old mm_sched_cpu is not in p's current preferred node, last_m_a_occ is
always 0, which makes the switching of mm->mm_sched_cpu always succeed
due to the condition if (m_a_occ > (2 * last_m_a_occ)). Anyway, since it
is a software issue, we can find a way to address it.

Maybe we also following Abel's suggestion that only one thread of
the process is allowed to perform the statistic calculation, this
could minimal the negative impact to the whole process.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 01/20] sched: Cache aware load-balancing
  2025-06-26 13:32     ` Chen, Yu C
@ 2025-06-27  0:10       ` Tim Chen
  2025-06-27  2:13         ` Jianyong Wu
  0 siblings, 1 reply; 68+ messages in thread
From: Tim Chen @ 2025-06-27  0:10 UTC (permalink / raw)
  To: Chen, Yu C, Jianyong Wu
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Gautham R . Shenoy, Ingo Molnar, K Prateek Nayak,
	Peter Zijlstra

On Thu, 2025-06-26 at 21:32 +0800, Chen, Yu C wrote:
> 
> > 
> > This task work may take a long time for the system with large number 
> > cpus which increacing the delay for process back to userspace. It may be 
> > the reason that schbench benchmark regressed so much.
> > 
> 
> Thanks for the insight Jianyong, yes, the scan on all online CPUs would
> be costly.
> 
> > To avoid searching the whole system, what about just searching the 
> > preferred numa node provided by numa balancing if there is one. If not, 
> > then fallback to search the whole system or just search the numa node 
> > where the main process locates as there is a high probability it 
> > contains the preferred LLC. In other words, we can opt for a suboptimal 
> > LLC location to prioritize speed.
> > 
> > WDYT?
> > 
> This is a good idea. Previously, Tim had a version that dealt with a
> similar scenario, which only scanned the CPUs within p's preferred node.

Yes, we were also thinking along the line of looking only at the preferred
node.

>   However, it seems to cause bouncing of the mm->mm_sched_cpu because we
> set a 2X threshold for switching the mm->mm_sched_cpu in patch 5. If the
> old mm_sched_cpu is not in p's current preferred node, last_m_a_occ is
> always 0, which makes the switching of mm->mm_sched_cpu always succeed
> due to the condition if (m_a_occ > (2 * last_m_a_occ)). 
> 
There were some regressions on schbench during out tests and preferred
LLC bounces switches a lot with preferred node as mentioned by
Chen Yu.  For schbench, there's really not much NUMA data and preferred
node bounces around. We'll have to figure out the right thing
to do if preferred node changes and preferred LLC falls outside the
preferred node.

Tim

> Anyway, since it
> is a software issue, we can find a way to address it.
> 
> Maybe we also following Abel's suggestion that only one thread of
> the process is allowed to perform the statistic calculation, this
> could minimal the negative impact to the whole process.
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 01/20] sched: Cache aware load-balancing
  2025-06-27  0:10       ` Tim Chen
@ 2025-06-27  2:13         ` Jianyong Wu
  0 siblings, 0 replies; 68+ messages in thread
From: Jianyong Wu @ 2025-06-27  2:13 UTC (permalink / raw)
  To: Tim Chen, Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Gautham R . Shenoy, Ingo Molnar, K Prateek Nayak,
	Peter Zijlstra

Hi Tim, Chen,

On 6/27/2025 8:10 AM, Tim Chen wrote:
> On Thu, 2025-06-26 at 21:32 +0800, Chen, Yu C wrote:
>>
>>>
>>> This task work may take a long time for the system with large number
>>> cpus which increacing the delay for process back to userspace. It may be
>>> the reason that schbench benchmark regressed so much.
>>>
>>
>> Thanks for the insight Jianyong, yes, the scan on all online CPUs would
>> be costly.
>>
>>> To avoid searching the whole system, what about just searching the
>>> preferred numa node provided by numa balancing if there is one. If not,
>>> then fallback to search the whole system or just search the numa node
>>> where the main process locates as there is a high probability it
>>> contains the preferred LLC. In other words, we can opt for a suboptimal
>>> LLC location to prioritize speed.
>>>
>>> WDYT?
>>>
>> This is a good idea. Previously, Tim had a version that dealt with a
>> similar scenario, which only scanned the CPUs within p's preferred node.
> 
> Yes, we were also thinking along the line of looking only at the preferred
> node.
> 
>>    However, it seems to cause bouncing of the mm->mm_sched_cpu because we
>> set a 2X threshold for switching the mm->mm_sched_cpu in patch 5. If the
>> old mm_sched_cpu is not in p's current preferred node, last_m_a_occ is
>> always 0, which makes the switching of mm->mm_sched_cpu always succeed
>> due to the condition if (m_a_occ > (2 * last_m_a_occ)).
>>
> There were some regressions on schbench during out tests and preferred
> LLC bounces switches a lot with preferred node as mentioned by
> Chen Yu.  For schbench, there's really not much NUMA data and preferred
> node bounces around. We'll have to figure out the right thing
> to do if preferred node changes and preferred LLC falls outside the
> preferred node.
> 
> Tim
> 
>> Anyway, since it
>> is a software issue, we can find a way to address it.
>>
>> Maybe we also following Abel's suggestion that only one thread of
>> the process is allowed to perform the statistic calculation, this
>> could minimal the negative impact to the whole process.
>>
> 
> 
Thanks for explanation. Get it.

Thanks
Jianyong


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 05/20] sched: Add hysteresis to switch a task's preferred LLC
  2025-06-18 18:27 ` [RFC patch v3 05/20] sched: Add hysteresis to switch a task's preferred LLC Tim Chen
@ 2025-07-02  6:47   ` Madadi Vineeth Reddy
  2025-07-02 21:47     ` Tim Chen
  0 siblings, 1 reply; 68+ messages in thread
From: Madadi Vineeth Reddy @ 2025-07-02  6:47 UTC (permalink / raw)
  To: Tim Chen
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Hillf Danton, Len Brown, linux-kernel,
	Chen Yu, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Madadi Vineeth Reddy

Hi Tim,

On 18/06/25 23:57, Tim Chen wrote:
> Switching a process's preferred LLC generates lots of task
> migrations across LLCs. To avoid frequent switches
> of home LLC, implement the following policy:
> 
> 1. Require a 2x occ change threshold to switch preferred LLC
> 2. Don't discard preferred LLC for a task
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c | 24 ++++++++++++++++--------
>  1 file changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6a2678f9d44a..7fb2322c5d9e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1175,6 +1175,14 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
>  #define EPOCH_PERIOD	(HZ/100)	/* 10 ms */
>  #define EPOCH_OLD	5		/* 50 ms */
>  
> +static int llc_id(int cpu)
> +{
> +	if (cpu < 0)
> +		return -1;
> +
> +	return per_cpu(sd_llc_id, cpu);
> +}
> +
>  void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
>  {
>  	unsigned long epoch;
> @@ -1299,6 +1307,7 @@ static void task_cache_work(struct callback_head *work)
>  	struct task_struct *p = current;
>  	struct mm_struct *mm = p->mm;
>  	unsigned long m_a_occ = 0;
> +	unsigned long last_m_a_occ = 0;
>  	int cpu, m_a_cpu = -1;
>  	cpumask_var_t cpus;
>  
> @@ -1337,11 +1346,13 @@ static void task_cache_work(struct callback_head *work)
>  					     per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
>  			}
>  
> -			a_occ /= nr;
> +			// a_occ /= nr;

Is the above by mistake?
I think we need to have average only and not the total value as that favors LLCs with
larger size.

Thanks,
Madadi Vineeth Reddy

>  			if (a_occ > m_a_occ) {
>  				m_a_occ = a_occ;
>  				m_a_cpu = m_cpu;
>  			}
> +			if (llc_id(cpu) == llc_id(mm->mm_sched_cpu))
> +				last_m_a_occ = a_occ;
>  
>  			trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
>  				     per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
> @@ -1355,13 +1366,10 @@ static void task_cache_work(struct callback_head *work)
>  		}
>  	}
>  
> -	/*
> -	 * If the max average cache occupancy is 'small' we don't care.
> -	 */
> -	if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
> -		m_a_cpu = -1;
> -
> -	mm->mm_sched_cpu = m_a_cpu;
> +	if (m_a_occ > (2 * last_m_a_occ)) {
> +		/* avoid the bouncing of mm_sched_cpu */
> +		mm->mm_sched_cpu = m_a_cpu;
> +	}
>  
>  	free_cpumask_var(cpus);
>  }


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 05/20] sched: Add hysteresis to switch a task's preferred LLC
  2025-07-02  6:47   ` Madadi Vineeth Reddy
@ 2025-07-02 21:47     ` Tim Chen
  0 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-07-02 21:47 UTC (permalink / raw)
  To: Madadi Vineeth Reddy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Hillf Danton, Len Brown, linux-kernel,
	Chen Yu, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy

On Wed, 2025-07-02 at 12:17 +0530, Madadi Vineeth Reddy wrote:
> Hi Tim,
> 
> On 18/06/25 23:57, Tim Chen wrote:
> > Switching a process's preferred LLC generates lots of task
> > migrations across LLCs. To avoid frequent switches
> > of home LLC, implement the following policy:
> > 
> > 1. Require a 2x occ change threshold to switch preferred LLC
> > 2. Don't discard preferred LLC for a task
> > 
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> >  kernel/sched/fair.c | 24 ++++++++++++++++--------
> >  1 file changed, 16 insertions(+), 8 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6a2678f9d44a..7fb2322c5d9e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1175,6 +1175,14 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
> >  #define EPOCH_PERIOD	(HZ/100)	/* 10 ms */
> >  #define EPOCH_OLD	5		/* 50 ms */
> >  
> > +static int llc_id(int cpu)
> > +{
> > +	if (cpu < 0)
> > +		return -1;
> > +
> > +	return per_cpu(sd_llc_id, cpu);
> > +}
> > +
> >  void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
> >  {
> >  	unsigned long epoch;
> > @@ -1299,6 +1307,7 @@ static void task_cache_work(struct callback_head *work)
> >  	struct task_struct *p = current;
> >  	struct mm_struct *mm = p->mm;
> >  	unsigned long m_a_occ = 0;
> > +	unsigned long last_m_a_occ = 0;
> >  	int cpu, m_a_cpu = -1;
> >  	cpumask_var_t cpus;
> >  
> > @@ -1337,11 +1346,13 @@ static void task_cache_work(struct callback_head *work)
> >  					     per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
> >  			}
> >  
> > -			a_occ /= nr;
> > +			// a_occ /= nr;
> 
> Is the above by mistake?
> I think we need to have average only and not the total value as that favors LLCs with
> larger size.
> 

Actually Chen Yu and I have gone back and forth on this one.  A
different perspective is dividing by nr will disfavor
LLCs of larger size.  You will need way more tasks in larger
LLC to put the tasks there, which may cause over-stacking on the
smaller LLC. We find that not dividing by nr is more stable when
we bring CPU online/offline.  

Tim

> Thanks,
> Madadi Vineeth Reddy
> 
> >  			if (a_occ > m_a_occ) {
> >  				m_a_occ = a_occ;
> >  				m_a_cpu = m_cpu;
> >  			}
> > +			if (llc_id(cpu) == llc_id(mm->mm_sched_cpu))
> > +				last_m_a_occ = a_occ;
> >  
> >  			trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
> >  				     per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
> > @@ -1355,13 +1366,10 @@ static void task_cache_work(struct callback_head *work)
> >  		}
> >  	}
> >  
> > -	/*
> > -	 * If the max average cache occupancy is 'small' we don't care.
> > -	 */
> > -	if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
> > -		m_a_cpu = -1;
> > -
> > -	mm->mm_sched_cpu = m_a_cpu;
> > +	if (m_a_occ > (2 * last_m_a_occ)) {
> > +		/* avoid the bouncing of mm_sched_cpu */
> > +		mm->mm_sched_cpu = m_a_cpu;
> > +	}
> >  
> >  	free_cpumask_var(cpus);
> >  }
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 01/20] sched: Cache aware load-balancing
  2025-06-18 18:27 ` [RFC patch v3 01/20] sched: Cache aware load-balancing Tim Chen
  2025-06-26 12:23   ` Jianyong Wu
@ 2025-07-03 19:29   ` Shrikanth Hegde
  2025-07-04  8:40     ` Chen, Yu C
  2025-07-07 19:57     ` Tim Chen
  1 sibling, 2 replies; 68+ messages in thread
From: Shrikanth Hegde @ 2025-07-03 19:29 UTC (permalink / raw)
  To: Tim Chen, Chen Yu
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy



Hi Tim, Chen,
skimming through the series and will try to go through in coming days.

> 
> One of the many things on the eternal todo list has been finishing the
> below hackery.
> 
> It is an attempt at modelling cache affinity -- and while the patch
> really only targets LLC, it could very well be extended to also apply to
> clusters (L2). Specifically any case of multiple cache domains inside a
> node.
> 
> Anyway, I wrote this about a year ago, and I mentioned this at the
> recent OSPM conf where Gautham and Prateek expressed interest in playing
> with this code.
> 
> So here goes, very rough and largely unproven code ahead :-)
> 
> It applies to current tip/master, but I know it will fail the __percpu
> validation that sits in -next, although that shouldn't be terribly hard
> to fix up.
> 
> As is, it only computes a CPU inside the LLC that has the highest recent
> runtime, this CPU is then used in the wake-up path to steer towards this
> LLC and in task_hot() to limit migrations away from it.
> 
> More elaborate things could be done, notably there is an XXX in there
> somewhere about finding the best LLC inside a NODE (interaction with
> NUMA_BALANCING).
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>   include/linux/mm_types.h |  44 ++++++
>   include/linux/sched.h    |   4 +
>   init/Kconfig             |   4 +
>   kernel/fork.c            |   5 +
>   kernel/sched/core.c      |  13 +-
>   kernel/sched/fair.c      | 330 +++++++++++++++++++++++++++++++++++++--
>   kernel/sched/sched.h     |   8 +
>   7 files changed, 388 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 56d07edd01f9..013291c6aaa2 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -893,6 +893,12 @@ struct mm_cid {
>   };
>   #endif
>   
> +struct mm_sched {
> +	u64 runtime;
> +	unsigned long epoch;
> +	unsigned long occ;
> +};
> +
>   struct kioctx_table;
>   struct iommu_mm_data;
>   struct mm_struct {
> @@ -983,6 +989,17 @@ struct mm_struct {
>   		 */
>   		raw_spinlock_t cpus_allowed_lock;
>   #endif
> +#ifdef CONFIG_SCHED_CACHE
> +		/*
> +		 * Track per-cpu-per-process occupancy as a proxy for cache residency.
> +		 * See account_mm_sched() and ...
> +		 */
> +		struct mm_sched __percpu *pcpu_sched;
> +		raw_spinlock_t mm_sched_lock;
> +		unsigned long mm_sched_epoch;
> +		int mm_sched_cpu;
> +#endif
> +
>   #ifdef CONFIG_MMU
>   		atomic_long_t pgtables_bytes;	/* size of all page tables */
>   #endif
> @@ -1393,6 +1410,33 @@ static inline unsigned int mm_cid_size(void)
>   static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
>   #endif /* CONFIG_SCHED_MM_CID */
>   
> +#ifdef CONFIG_SCHED_CACHE
> +extern void mm_init_sched(struct mm_struct *mm, struct mm_sched *pcpu_sched);
> +
> +static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
> +{
> +	struct mm_sched *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
> +	if (!pcpu_sched)
> +		return -ENOMEM;
> +
> +	mm_init_sched(mm, pcpu_sched);
> +	return 0;
> +}
> +
> +#define mm_alloc_sched(...)	alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__))
> +
> +static inline void mm_destroy_sched(struct mm_struct *mm)
> +{
> +	free_percpu(mm->pcpu_sched);
> +	mm->pcpu_sched = NULL;
> +}
> +#else /* !CONFIG_SCHED_CACHE */
> +
> +static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; }
> +static inline void mm_destroy_sched(struct mm_struct *mm) { }
> +
> +#endif /* CONFIG_SCHED_CACHE */
> +
>   struct mmu_gather;
>   extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
>   extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac1982893..d0e4cda2b3cd 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1399,6 +1399,10 @@ struct task_struct {
>   	unsigned long			numa_pages_migrated;
>   #endif /* CONFIG_NUMA_BALANCING */
>   
> +#ifdef CONFIG_SCHED_CACHE
> +	struct callback_head		cache_work;
> +#endif
> +
>   #ifdef CONFIG_RSEQ
>   	struct rseq __user *rseq;
>   	u32 rseq_len;
> diff --git a/init/Kconfig b/init/Kconfig
> index bf3a920064be..e2509127b6f9 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -953,6 +953,10 @@ config NUMA_BALANCING
>   
>   	  This system will be inactive on UMA systems.
>   
> +config SCHED_CACHE
> +	bool "Cache aware scheduler"
> +	default y
> +

Should it depend on EXPERT?
IMO this could add quite a bit of overhead and maybe n by default?

>   config NUMA_BALANCING_DEFAULT_ENABLED
>   	bool "Automatically enable NUMA aware memory/task placement"
>   	default y
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 168681fc4b25..da1387823b9e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1332,6 +1332,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>   	if (mm_alloc_cid(mm, p))
>   		goto fail_cid;
>   
> +	if (mm_alloc_sched(mm))
> +		goto fail_sched;
> +
>   	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
>   				     NR_MM_COUNTERS))
>   		goto fail_pcpu;
> @@ -1341,6 +1344,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>   	return mm;
>   
>   fail_pcpu:
> +	mm_destroy_sched(mm);
> +fail_sched:
>   	mm_destroy_cid(mm);
>   fail_cid:
>   	destroy_context(mm);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c81cf642dba0..d9c3e75f79d1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4524,6 +4524,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>   	p->migration_pending = NULL;
>   #endif
>   	init_sched_mm_cid(p);
> +	init_sched_mm(p);
>   }
>   
>   DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> @@ -8526,6 +8527,7 @@ static struct kmem_cache *task_group_cache __ro_after_init;
>   
>   void __init sched_init(void)
>   {
> +	unsigned long now = jiffies;
>   	unsigned long ptr = 0;
>   	int i;
>   
> @@ -8600,7 +8602,7 @@ void __init sched_init(void)
>   		raw_spin_lock_init(&rq->__lock);
>   		rq->nr_running = 0;
>   		rq->calc_load_active = 0;
> -		rq->calc_load_update = jiffies + LOAD_FREQ;
> +		rq->calc_load_update = now + LOAD_FREQ;
>   		init_cfs_rq(&rq->cfs);
>   		init_rt_rq(&rq->rt);
>   		init_dl_rq(&rq->dl);
> @@ -8644,7 +8646,7 @@ void __init sched_init(void)
>   		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
>   		rq->balance_callback = &balance_push_callback;
>   		rq->active_balance = 0;
> -		rq->next_balance = jiffies;
> +		rq->next_balance = now;
>   		rq->push_cpu = 0;
>   		rq->cpu = i;
>   		rq->online = 0;
> @@ -8656,7 +8658,7 @@ void __init sched_init(void)
>   
>   		rq_attach_root(rq, &def_root_domain);
>   #ifdef CONFIG_NO_HZ_COMMON
> -		rq->last_blocked_load_update_tick = jiffies;
> +		rq->last_blocked_load_update_tick = now;
>   		atomic_set(&rq->nohz_flags, 0);
>   
>   		INIT_CSD(&rq->nohz_csd, nohz_csd_func, rq);
> @@ -8681,6 +8683,11 @@ void __init sched_init(void)
>   
>   		rq->core_cookie = 0UL;
>   #endif
> +#ifdef CONFIG_SCHED_CACHE
> +		raw_spin_lock_init(&rq->cpu_epoch_lock);
> +		rq->cpu_epoch_next = now;
> +#endif
> +
>   		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
>   	}
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fb9bf995a47..df7d4a324fbe 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1166,10 +1166,229 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
>   	return delta_exec;
>   }
>   
> -static inline void update_curr_task(struct task_struct *p, s64 delta_exec)
> +#ifdef CONFIG_SCHED_CACHE
> +
> +/*
> + * XXX numbers come from a place the sun don't shine -- probably wants to be SD
> + * tunable or so.
> + */
> +#define EPOCH_PERIOD	(HZ/100)	/* 10 ms */
> +#define EPOCH_OLD	5		/* 50 ms */

Have these been converted into tunables? I didn't spot that in the series.

> +
> +void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched)
> +{
> +	unsigned long epoch;
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		struct mm_sched *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
> +		struct rq *rq = cpu_rq(i);
> +
> +		pcpu_sched->runtime = 0;
> +		pcpu_sched->epoch = epoch = rq->cpu_epoch;
> +		pcpu_sched->occ = -1;
> +	}
> +
> +	raw_spin_lock_init(&mm->mm_sched_lock);
> +	mm->mm_sched_epoch = epoch;
> +	mm->mm_sched_cpu = -1;
> +
> +	smp_store_release(&mm->pcpu_sched, _pcpu_sched);
> +}
> +
> +/* because why would C be fully specified */
> +static __always_inline void __shr_u64(u64 *val, unsigned int n)
> +{
> +	if (n >= 64) {
> +		*val = 0;
> +		return;
> +	}
> +	*val >>= n;
> +}
> +
> +static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
> +{
> +	lockdep_assert_held(&rq->cpu_epoch_lock);
> +
> +	unsigned long n, now = jiffies;
> +	long delta = now - rq->cpu_epoch_next;
> +
> +	if (delta > 0) {
> +		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
> +		rq->cpu_epoch += n;
> +		rq->cpu_epoch_next += n * EPOCH_PERIOD;
> +		__shr_u64(&rq->cpu_runtime, n);

Another doubt i had, does this occupancy works when there is CPU bandwidth controller running?
A 50% occupancy may have different meaning when CPU bandwidth is set to 50%?

> +	}
> +
> +	n = rq->cpu_epoch - pcpu_sched->epoch;
> +	if (n) {
> +		pcpu_sched->epoch += n;
> +		__shr_u64(&pcpu_sched->runtime, n);
> +	}
> +}
> +
> +static unsigned long fraction_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
> +{
> +	guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
> +
> +	__update_mm_sched(rq, pcpu_sched);
> +
> +	/*
> +	 * Runtime is a geometric series (r=0.5) and as such will sum to twice
> +	 * the accumulation period, this means the multiplcation here should
> +	 * not overflow.
> +	 */
> +	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
> +}
> +
> +static inline
> +void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> +{
> +	struct mm_struct *mm = p->mm;
> +	struct mm_sched *pcpu_sched;
> +	unsigned long epoch;
> +
> +	/*
> +	 * init_task and kthreads don't be having no mm
> +	 */
> +	if (!mm || !mm->pcpu_sched)
> +		return;
> +
> +	pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
> +
> +	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
> +		__update_mm_sched(rq, pcpu_sched);
> +		pcpu_sched->runtime += delta_exec;
> +		rq->cpu_runtime += delta_exec;
> +		epoch = rq->cpu_epoch;
> +	}
> +
> +	/*
> +	 * If this task hasn't hit task_cache_work() for a while, invalidate
> +	 * it's preferred state.
> +	 */
> +	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
> +		mm->mm_sched_cpu = -1;
> +		pcpu_sched->occ = -1;
> +	}
> +}
> +
> +static void task_tick_cache(struct rq *rq, struct task_struct *p)
> +{
> +	struct callback_head *work = &p->cache_work;
> +	struct mm_struct *mm = p->mm;
> +
> +	if (!mm || !mm->pcpu_sched)
> +		return;
> +
> +	if (mm->mm_sched_epoch == rq->cpu_epoch)
> +		return;
> +
> +	guard(raw_spinlock)(&mm->mm_sched_lock);
> +
> +	if (mm->mm_sched_epoch == rq->cpu_epoch)
> +		return;
> +
> +	if (work->next == work) {
> +		task_work_add(p, work, TWA_RESUME);
> +		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
> +	}
> +}
> +
> +static void task_cache_work(struct callback_head *work)
> +{
> +	struct task_struct *p = current;
> +	struct mm_struct *mm = p->mm;
> +	unsigned long m_a_occ = 0;
> +	int cpu, m_a_cpu = -1;
> +	cpumask_var_t cpus;
> +
> +	WARN_ON_ONCE(work != &p->cache_work);
> +
> +	work->next = work;
> +
> +	if (p->flags & PF_EXITING)
> +		return;
> +
> +	if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
> +		return;
> +
> +	scoped_guard (cpus_read_lock) {
> +		cpumask_copy(cpus, cpu_online_mask);
> +

As pointed out already, this is going to be costly in multi NUMA systems. Any cross NUMA access of
CPUs data is going to add overhead to system bus bandwidth and this happening at tick could be costly.

Also, taking cpu_read_lock does preempt_disable, this could add to large preemptoff?
We need to measure the time it takes on large system. Will try and get back with that number
> +		for_each_cpu(cpu, cpus) {
> +			/* XXX sched_cluster_active */
> +			struct sched_domain *sd = per_cpu(sd_llc, cpu);
> +			unsigned long occ, m_occ = 0, a_occ = 0;
> +			int m_cpu = -1, nr = 0, i;
> +
> +			for_each_cpu(i, sched_domain_span(sd)) {
> +				occ = fraction_mm_sched(cpu_rq(i),
> +							per_cpu_ptr(mm->pcpu_sched, i));
> +				a_occ += occ;
> +				if (occ > m_occ) {
> +					m_occ = occ;
> +					m_cpu = i;
> +				}
> +				nr++;
> +				trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: %d\n",
> +					     per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
> +			}
> +
> +			a_occ /= nr;
> +			if (a_occ > m_a_occ) {
> +				m_a_occ = a_occ;
> +				m_a_cpu = m_cpu;
> +			}
> +
> +			trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
> +				     per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
> +
> +			for_each_cpu(i, sched_domain_span(sd)) {
> +				/* XXX threshold ? */
> +				per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
> +			}
> +
> +			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
> +		}
> +	}
> +
> +	/*
> +	 * If the max average cache occupancy is 'small' we don't care.
> +	 */
> +	if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
> +		m_a_cpu = -1;
> +
> +	mm->mm_sched_cpu = m_a_cpu;
> +
> +	free_cpumask_var(cpus);
> +}
> +
> +void init_sched_mm(struct task_struct *p)
> +{
> +	struct callback_head *work = &p->cache_work;
> +	init_task_work(work, task_cache_work);
> +	work->next = work;
> +}
> +
> +#else
> +
> +static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
> +				    s64 delta_exec) { }
> +
> +
> +void init_sched_mm(struct task_struct *p) { }
> +
> +static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
> +
> +#endif
> +
> +static inline
> +void update_curr_task(struct rq *rq, struct task_struct *p, s64 delta_exec)
>   {
>   	trace_sched_stat_runtime(p, delta_exec);
>   	account_group_exec_runtime(p, delta_exec);
> +	account_mm_sched(rq, p, delta_exec);
>   	cgroup_account_cputime(p, delta_exec);
>   }
>   

AFAIU, this works and cares only about SCHED_NORMAL.
update_curr_task called by common for RT/DL too. Maybe avoid for those?

> @@ -1215,7 +1434,7 @@ s64 update_curr_common(struct rq *rq)
>   
>   	delta_exec = update_curr_se(rq, &donor->se);
>   	if (likely(delta_exec > 0))
> -		update_curr_task(donor, delta_exec);
> +		update_curr_task(rq, donor, delta_exec);
>   
>   	return delta_exec;
>   }
> @@ -1244,7 +1463,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
>   	if (entity_is_task(curr)) {
>   		struct task_struct *p = task_of(curr);
>   
> -		update_curr_task(p, delta_exec);
> +		update_curr_task(rq, p, delta_exec);
>   
>   		/*
>   		 * If the fair_server is active, we need to account for the
> @@ -7848,7 +8067,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>   	 * per-cpu select_rq_mask usage
>   	 */
>   	lockdep_assert_irqs_disabled();
> -
> +again:
>   	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
>   	    asym_fits_cpu(task_util, util_min, util_max, target))
>   		return target;
> @@ -7886,7 +8105,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>   	/* Check a recently used CPU as a potential idle candidate: */
>   	recent_used_cpu = p->recent_used_cpu;
>   	p->recent_used_cpu = prev;
> -	if (recent_used_cpu != prev &&
> +	if (prev == p->wake_cpu &&
> +	    recent_used_cpu != prev &&
>   	    recent_used_cpu != target &&
>   	    cpus_share_cache(recent_used_cpu, target) &&
>   	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> @@ -7939,6 +8159,18 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>   	if ((unsigned)i < nr_cpumask_bits)
>   		return i;
>   
> +	if (prev != p->wake_cpu && !cpus_share_cache(prev, p->wake_cpu)) {
> +		/*
> +		 * Most likely select_cache_cpu() will have re-directed
> +		 * the wakeup, but getting here means the preferred cache is
> +		 * too busy, so re-try with the actual previous.
> +		 *
> +		 * XXX wake_affine is lost for this pass.
> +		 */
> +		prev = target = p->wake_cpu;
> +		goto again;
> +	}
> +
>   	/*
>   	 * For cluster machines which have lower sharing cache like L2 or
>   	 * LLC Tag, we tend to find an idle CPU in the target's cluster
> @@ -8561,6 +8793,40 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   	return target;
>   }
>   
> +#ifdef CONFIG_SCHED_CACHE
> +static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
> +
> +static int select_cache_cpu(struct task_struct *p, int prev_cpu)
> +{
> +	struct mm_struct *mm = p->mm;
> +	int cpu;
> +
> +	if (!mm || p->nr_cpus_allowed == 1)
> +		return prev_cpu;
> +
> +	cpu = mm->mm_sched_cpu;
> +	if (cpu < 0)
> +		return prev_cpu;
> +
> +
> +	if (static_branch_likely(&sched_numa_balancing) &&
> +	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
> +		/*
> +		 * XXX look for max occupancy inside prev_cpu's node
> +		 */
> +		return prev_cpu;
> +	}
> +
> +	return cpu;
> +}
> +#else
> +static int select_cache_cpu(struct task_struct *p, int prev_cpu)
> +{
> +	return prev_cpu;
> +}
> +#endif
> +
> +
>   /*
>    * select_task_rq_fair: Select target runqueue for the waking task in domains
>    * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -8586,6 +8852,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>   	 * required for stable ->cpus_allowed
>   	 */
>   	lockdep_assert_held(&p->pi_lock);
> +	guard(rcu)();
> +
>   	if (wake_flags & WF_TTWU) {
>   		record_wakee(p);
>   
> @@ -8593,6 +8861,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>   		    cpumask_test_cpu(cpu, p->cpus_ptr))
>   			return cpu;
>   
> +		new_cpu = prev_cpu = select_cache_cpu(p, prev_cpu);
> +
>   		if (!is_rd_overutilized(this_rq()->rd)) {
>   			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>   			if (new_cpu >= 0)
> @@ -8603,7 +8873,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>   		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
>   	}
>   
> -	rcu_read_lock();
>   	for_each_domain(cpu, tmp) {
>   		/*
>   		 * If both 'cpu' and 'prev_cpu' are part of this domain,
> @@ -8636,7 +8905,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>   		/* Fast path */
>   		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
>   	}
> -	rcu_read_unlock();
>   
>   	return new_cpu;
>   }
> @@ -9286,6 +9554,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>   	if (sysctl_sched_migration_cost == 0)
>   		return 0;
>   
> +#ifdef CONFIG_SCHED_CACHE
> +	if (p->mm && p->mm->pcpu_sched) {
> +		/*
> +		 * XXX things like Skylake have non-inclusive L3 and might not
> +		 * like this L3 centric view. What to do about L2 stickyness ?
> +		 */
> +		return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
> +		       per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
> +	}
> +#endif
> +
>   	delta = rq_clock_task(env->src_rq) - p->se.exec_start;
>   
>   	return delta < (s64)sysctl_sched_migration_cost;
> @@ -9297,27 +9576,25 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>    * Returns 0, if task migration is not affected by locality.
>    * Returns a negative value, if task migration improves locality i.e migration preferred.
>    */
> -static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
> +static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle)
>   {
>   	struct numa_group *numa_group = rcu_dereference(p->numa_group);
>   	unsigned long src_weight, dst_weight;
>   	int src_nid, dst_nid, dist;
>   
> -	if (!static_branch_likely(&sched_numa_balancing))
> -		return 0;
> -
> -	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
> +	if (!p->numa_faults)
>   		return 0;
>   
> -	src_nid = cpu_to_node(env->src_cpu);
> -	dst_nid = cpu_to_node(env->dst_cpu);
> +	src_nid = cpu_to_node(src_cpu);
> +	dst_nid = cpu_to_node(dst_cpu);
>   
>   	if (src_nid == dst_nid)
>   		return 0;
>   
>   	/* Migrating away from the preferred node is always bad. */
>   	if (src_nid == p->numa_preferred_nid) {
> -		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
> +		struct rq *src_rq = cpu_rq(src_cpu);
> +		if (src_rq->nr_running > src_rq->nr_preferred_running)
>   			return 1;
>   		else
>   			return 0;
> @@ -9328,7 +9605,7 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
>   		return -1;
>   
>   	/* Leaving a core idle is often worse than degrading locality. */
> -	if (env->idle == CPU_IDLE)
> +	if (idle)
>   		return 0;
>   
>   	dist = node_distance(src_nid, dst_nid);
> @@ -9343,7 +9620,24 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
>   	return src_weight - dst_weight;
>   }
>   
> +static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
> +{
> +	if (!static_branch_likely(&sched_numa_balancing))
> +		return 0;
> +
> +	if (!(env->sd->flags & SD_NUMA))
> +		return 0;
> +
> +	return __migrate_degrades_locality(p, env->src_cpu, env->dst_cpu,
> +					   env->idle == CPU_IDLE);
> +}
> +
>   #else
> +static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle)
> +{
> +	return 0;
> +}
> +
>   static inline long migrate_degrades_locality(struct task_struct *p,
>   					     struct lb_env *env)
>   {
> @@ -13102,8 +13396,8 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
>    */
>   static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>   {
> -	struct cfs_rq *cfs_rq;
>   	struct sched_entity *se = &curr->se;
> +	struct cfs_rq *cfs_rq;
>   
>   	for_each_sched_entity(se) {
>   		cfs_rq = cfs_rq_of(se);
> @@ -13113,6 +13407,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>   	if (static_branch_unlikely(&sched_numa_balancing))
>   		task_tick_numa(rq, curr);
>   
> +	task_tick_cache(rq, curr);
> +
>   	update_misfit_status(curr, rq);
>   	check_update_overutilized_status(task_rq(curr));
>   
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 47972f34ea70..d16ccd66ca07 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1171,6 +1171,12 @@ struct rq {
>   	u64			clock_pelt_idle_copy;
>   	u64			clock_idle_copy;
>   #endif
> +#ifdef CONFIG_SCHED_CACHE
> +	raw_spinlock_t		cpu_epoch_lock;
> +	u64			cpu_runtime;
> +	unsigned long		cpu_epoch;
> +	unsigned long		cpu_epoch_next;
> +#endif
>   

Maybe these can go to their own cacheline?

>   	atomic_t		nr_iowait;
>   
> @@ -3861,6 +3867,8 @@ static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) { }
>   static inline void init_sched_mm_cid(struct task_struct *t) { }
>   #endif /* !CONFIG_SCHED_MM_CID */
>   
> +extern void init_sched_mm(struct task_struct *p);
> +
>   extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
>   extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
>   #ifdef CONFIG_SMP


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling
  2025-06-18 18:27 ` [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling Tim Chen
@ 2025-07-03 19:33   ` Shrikanth Hegde
  2025-07-07 21:02     ` Tim Chen
  2025-07-08  1:15   ` Libo Chen
  1 sibling, 1 reply; 68+ messages in thread
From: Shrikanth Hegde @ 2025-07-03 19:33 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel



On 6/18/25 23:57, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
> 
> 1. Fix compile error on percpu allocation.
> 2. Enqueue to the target CPU rather than the current CPU.
> 3. NULL LLC sched domain check(Libo Chen).
> 4. Introduce sched feature SCHED_CACHE to control cache aware scheduling
> 5. Fix unsigned occupancy initialization to -1.
> 6. If there is only 1 thread in the process, no need to enable cache
>     awareness
> 7. Add __maybe_unused to __migrate_degrades_locality() to
>     avoid compile warnings.
> 
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>   include/linux/mm_types.h |  4 ++--
>   kernel/sched/fair.c      | 27 ++++++++++++++++-----------
>   kernel/sched/features.h  |  1 +
>   3 files changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 013291c6aaa2..9de4a0a13c4d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1411,11 +1411,11 @@ static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumas
>   #endif /* CONFIG_SCHED_MM_CID */
>   
>   #ifdef CONFIG_SCHED_CACHE
> -extern void mm_init_sched(struct mm_struct *mm, struct mm_sched *pcpu_sched);
> +extern void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sched);
>   
>   static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
>   {
> -	struct mm_sched *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
> +	struct mm_sched __percpu *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
>   	if (!pcpu_sched)
>   		return -ENOMEM;
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df7d4a324fbe..89db97f8ef02 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1175,7 +1175,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
>   #define EPOCH_PERIOD	(HZ/100)	/* 10 ms */
>   #define EPOCH_OLD	5		/* 50 ms */
>   
> -void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched)
> +void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
>   {
>   	unsigned long epoch;
>   	int i;
> @@ -1186,7 +1186,7 @@ void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched)
>   
>   		pcpu_sched->runtime = 0;
>   		pcpu_sched->epoch = epoch = rq->cpu_epoch;
> -		pcpu_sched->occ = -1;
> +		pcpu_sched->occ = 0;
>   	}
>   
>   	raw_spin_lock_init(&mm->mm_sched_lock);
> @@ -1254,7 +1254,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>   	if (!mm || !mm->pcpu_sched)
>   		return;
>   
> -	pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
> +	pcpu_sched = per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq));
>   
>   	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
>   		__update_mm_sched(rq, pcpu_sched);
> @@ -1264,12 +1264,14 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>   	}
>   
>   	/*
> -	 * If this task hasn't hit task_cache_work() for a while, invalidate
> +	 * If this task hasn't hit task_cache_work() for a while, or it
> +	 * has only 1 thread, invalidate
>   	 * it's preferred state.
>   	 */
> -	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
> +	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD ||
> +	    get_nr_threads(p) <= 1) {
>   		mm->mm_sched_cpu = -1;
> -		pcpu_sched->occ = -1;
> +		pcpu_sched->occ = 0;
>   	}
>   }
>   
> @@ -1286,9 +1288,6 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
>   
>   	guard(raw_spinlock)(&mm->mm_sched_lock);
>   
> -	if (mm->mm_sched_epoch == rq->cpu_epoch)
> -		return;
> -
>   	if (work->next == work) {
>   		task_work_add(p, work, TWA_RESUME);
>   		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
> @@ -1322,6 +1321,9 @@ static void task_cache_work(struct callback_head *work)
>   			unsigned long occ, m_occ = 0, a_occ = 0;
>   			int m_cpu = -1, nr = 0, i;
>   
> +			if (!sd)
> +				continue;
> +
>   			for_each_cpu(i, sched_domain_span(sd)) {
>   				occ = fraction_mm_sched(cpu_rq(i),
>   							per_cpu_ptr(mm->pcpu_sched, i));
> @@ -8801,6 +8803,9 @@ static int select_cache_cpu(struct task_struct *p, int prev_cpu)
>   	struct mm_struct *mm = p->mm;
>   	int cpu;
>   
> +	if (!sched_feat(SCHED_CACHE))
> +		return prev_cpu;
> +
>   	if (!mm || p->nr_cpus_allowed == 1)
>   		return prev_cpu;
>   
> @@ -9555,7 +9560,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>   		return 0;
>   
>   #ifdef CONFIG_SCHED_CACHE
> -	if (p->mm && p->mm->pcpu_sched) {
> +	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->pcpu_sched) {
>   		/*
>   		 * XXX things like Skylake have non-inclusive L3 and might not
>   		 * like this L3 centric view. What to do about L2 stickyness ?
> @@ -9633,7 +9638,7 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
>   }
>   
>   #else
> -static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle)
> +static __maybe_unused long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle)
>   {
>   	return 0;
>   }
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 3c12d9f93331..d2af7bfd36bf 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -87,6 +87,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>    */
>   SCHED_FEAT(SIS_UTIL, true)
>   
> +SCHED_FEAT(SCHED_CACHE, true)

Having both SCHED_FEAT and CONFIG_SCHED_CACHE seems like overkill.
Is it really necessary to have both?

Also, given the complexity it brings and only a workloads which spawns threads
which have data sharing among them benefit, it could be false by default.

>   /*
>    * Issue a WARN when we do multiple update_rq_clock() calls
>    * in a single rq->lock section. Default disabled because the


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 04/20] sched: Avoid calculating the cpumask if the system is overloaded
  2025-06-18 18:27 ` [RFC patch v3 04/20] sched: Avoid calculating the cpumask if the system is overloaded Tim Chen
@ 2025-07-03 19:39   ` Shrikanth Hegde
  2025-07-07 14:57     ` Tim Chen
  0 siblings, 1 reply; 68+ messages in thread
From: Shrikanth Hegde @ 2025-07-03 19:39 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Chen Yu, Gautham R . Shenoy



On 6/18/25 23:57, Tim Chen wrote:
> From: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> If the SIS_UTIL cuts off idle cpu search, result of the cpumask_and() is
> of no use. Since select_idle_cpu() can now be called twice per wake up
> in the select_idle_sibling() due to cache aware wake up, this overhead
> can be visible in benchmarks like hackbench.
> 
> To save some additional cycles, especially in cases where we target
> the LLC frequently and the search bails out because the LLC is busy,
> only calculate the cpumask if the system is not overloaded.
> 

This patch could be independent and should help in general.
But changelog needs to be updated.


> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>   kernel/sched/fair.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 567ad2a0cfa2..6a2678f9d44a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7918,8 +7918,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>   	int i, cpu, idle_cpu = -1, nr = INT_MAX;
>   	struct sched_domain_shared *sd_share;
>   
> -	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> -
>   	if (sched_feat(SIS_UTIL)) {
>   		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
>   		if (sd_share) {
> @@ -7931,6 +7929,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>   		}
>   	}
>   
> +	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +
>   	if (static_branch_unlikely(&sched_cluster_active)) {
>   		struct sched_group *sg = sd->groups;
>   


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 08/20] sched: Set up LLC indexing
  2025-06-18 18:27 ` [RFC patch v3 08/20] sched: Set up LLC indexing Tim Chen
@ 2025-07-03 19:44   ` Shrikanth Hegde
  2025-07-04  9:36     ` Chen, Yu C
  0 siblings, 1 reply; 68+ messages in thread
From: Shrikanth Hegde @ 2025-07-03 19:44 UTC (permalink / raw)
  To: Tim Chen, Chen Yu
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy



On 6/18/25 23:57, Tim Chen wrote:
> Prepare for indexing arrays that track in each run queue: the number
> of tasks preferring current LLC and each of the other LLC.
> 
> The reason to introduce LLC index is because the per LLC-scope data
> is needed to do cache aware load balancing. However, the native lld_id
> is usually the first CPU of that LLC domain, which is not continuous,
> which might waste the space if the per LLC-scope data is stored
> in an array (in current implementation).
> 
> In the future, this LLC index could be removed after
> the native llc_id is used as the key to search into xarray based
> array.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>   include/linux/sched.h   |  3 +++
>   kernel/sched/fair.c     | 12 ++++++++++++
>   kernel/sched/sched.h    |  2 ++
>   kernel/sched/topology.c | 29 +++++++++++++++++++++++++++++
>   4 files changed, 46 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d0e4cda2b3cd..7ce95a32e9ff 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -810,6 +810,9 @@ struct kmap_ctrl {
>   #endif
>   };
>   
> +/* XXX need fix to not use magic number */
> +#define MAX_LLC 64

This number needs to be much higher. maybe keeping NR_CPUS wont hurt.

> +
>   struct task_struct {
>   #ifdef CONFIG_THREAD_INFO_IN_TASK
>   	/*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 10ea408d0e40..5549710d95cf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1183,6 +1183,18 @@ static int llc_id(int cpu)
>   	return per_cpu(sd_llc_id, cpu);
>   }
>   
> +/*
> + * continous index.
> + * TBD: replace by xarray with key llc_id()
> + */
> +static inline int llc_idx(int cpu)
> +{
> +	if (cpu < 0)
> +		return -1;
> +
> +	return per_cpu(sd_llc_idx, cpu);
> +}
> +
>   void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
>   {
>   	unsigned long epoch;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1c6fd45c7f62..74eb2f3615aa 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2037,6 +2037,7 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>   DECLARE_PER_CPU(int, sd_llc_size);
>   DECLARE_PER_CPU(int, sd_llc_id);
> +DECLARE_PER_CPU(int, sd_llc_idx);
>   DECLARE_PER_CPU(int, sd_share_id);
>   DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> @@ -2045,6 +2046,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>   
>   extern struct static_key_false sched_asym_cpucapacity;
>   extern struct static_key_false sched_cluster_active;
> +extern int max_llcs;
>   
>   static __always_inline bool sched_asym_cpucap_active(void)
>   {
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index f1ebc60d967f..b7bb13045dd8 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -672,6 +672,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>   DEFINE_PER_CPU(int, sd_llc_size);
>   DEFINE_PER_CPU(int, sd_llc_id);
> +DEFINE_PER_CPU(int, sd_llc_idx);
>   DEFINE_PER_CPU(int, sd_share_id);
>   DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> @@ -681,6 +682,25 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>   DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
>   DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
>   
> +int max_llcs = -1;
> +
> +static void update_llc_idx(int cpu)
> +{
> +#ifdef CONFIG_SCHED_CACHE
> +	int idx = -1, llc_id = -1;
> +
> +	llc_id = per_cpu(sd_llc_id, cpu);
> +	idx = per_cpu(sd_llc_idx, llc_id);
> +
> +	if (idx < 0) {
> +		idx = max_llcs++;
> +		BUG_ON(idx > MAX_LLC); 

maybe a warning instead here?

> +		per_cpu(sd_llc_idx, llc_id) = idx;
> +	}
> +	per_cpu(sd_llc_idx, cpu) = idx;
> +#endif
> +}
> +
>   static void update_top_cache_domain(int cpu)
>   {
>   	struct sched_domain_shared *sds = NULL;
> @@ -699,6 +719,7 @@ static void update_top_cache_domain(int cpu)
>   	per_cpu(sd_llc_size, cpu) = size;
>   	per_cpu(sd_llc_id, cpu) = id;
>   	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> +	update_llc_idx(cpu);
>   
>   	sd = lowest_flag_domain(cpu, SD_CLUSTER);
>   	if (sd)
> @@ -2394,6 +2415,14 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>   	bool has_asym = false;
>   	bool has_cluster = false;
>   
> +#ifdef CONFIG_SCHED_CACHE
> +	if (max_llcs < 0) {
> +		for_each_possible_cpu(i)
> +			per_cpu(sd_llc_idx, i) = -1;
> +		max_llcs = 0;
> +	}
> +#endif
> +
>   	if (WARN_ON(cpumask_empty(cpu_map)))
>   		goto error;
>   


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 10/20] sched: Calculate the number of tasks that have LLC preference on a runqueue
  2025-06-18 18:27 ` [RFC patch v3 10/20] sched: Calculate the number of tasks that have LLC preference on a runqueue Tim Chen
@ 2025-07-03 19:45   ` Shrikanth Hegde
  2025-07-04 15:00     ` Chen, Yu C
  0 siblings, 1 reply; 68+ messages in thread
From: Shrikanth Hegde @ 2025-07-03 19:45 UTC (permalink / raw)
  To: Tim Chen, Chen Yu
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy



On 6/18/25 23:57, Tim Chen wrote:
> Track for each run queue, the number of tasks that have a LLC preference
> and how many of those tasks are running in its preferred LLC.  This is
> similar to nr_numa_running and nr_preferred_running for NUMA balance,
> and will be used by the cache-aware load balancing in subsequent patches.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>   kernel/sched/core.c  | 12 ++++++++++++
>   kernel/sched/fair.c  | 42 +++++++++++++++++++++++++++++++++++++++++-
>   kernel/sched/sched.h |  7 +++++++
>   3 files changed, 60 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d9c3e75f79d1..34056eb79ef2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -498,6 +498,18 @@ void __trace_set_current_state(int state_value)
>   }
>   EXPORT_SYMBOL(__trace_set_current_state);
>   
> +#ifdef CONFIG_SMP


CONFIG_SMP is true unconditionally now. Else may need to go.

> +int task_llc(const struct task_struct *p)
> +{
> +	return per_cpu(sd_llc_id, task_cpu(p));
> +}
> +#else
> +int task_llc(const struct task_struct *p)
> +{
> +	return 0;
> +}
> +#endif
> +
>   /*
>    * Serialization rules:
>    *
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cc804a8c7061..88ff47194faa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1195,6 +1195,18 @@ static inline int llc_idx(int cpu)
>   	return per_cpu(sd_llc_idx, cpu);
>   }
>   
> +static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
> +{
> +	rq->nr_llc_running += (p->preferred_llc != -1);
> +	rq->nr_pref_llc_running += (p->preferred_llc == task_llc(p));
> +}
> +
> +static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
> +{
> +	rq->nr_llc_running -= (p->preferred_llc != -1);
> +	rq->nr_pref_llc_running -= (p->preferred_llc == task_llc(p));
> +}
> +
>   void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
>   {
>   	unsigned long epoch;
> @@ -1298,8 +1310,11 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>   	if (mm->mm_sched_cpu != -1)
>   		mm_sched_llc = per_cpu(sd_llc_id, mm->mm_sched_cpu);
>   
> -	if (p->preferred_llc != mm_sched_llc)
> +	if (p->preferred_llc != mm_sched_llc) {
> +		account_llc_dequeue(rq, p);
>   		p->preferred_llc = mm_sched_llc;
> +		account_llc_enqueue(rq, p);
> +	}
>   }
>   
>   static void task_tick_cache(struct rq *rq, struct task_struct *p)
> @@ -1400,6 +1415,14 @@ void init_sched_mm(struct task_struct *p)
>   	work->next = work;
>   }
>   
> +void reset_llc_stats(struct rq *rq)
> +{
> +	if (rq->nr_llc_running)
> +		rq->nr_llc_running = 0;
> +
> +	rq->nr_pref_llc_running = 0;
> +}
> +
>   #else
>   
>   static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
> @@ -1410,6 +1433,17 @@ void init_sched_mm(struct task_struct *p) { }
>   
>   static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
>   
> +static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
> +{
> +}
> +
> +static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
> +{
> +}
> +
> +void reset_llc_stats(struct rq *rq)
> +{
> +}
>   #endif
>   
>   static inline
> @@ -3939,6 +3973,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>   		struct rq *rq = rq_of(cfs_rq);
>   
>   		account_numa_enqueue(rq, task_of(se));
> +		account_llc_enqueue(rq, task_of(se));
>   		list_add(&se->group_node, &rq->cfs_tasks);
>   	}
>   #endif
> @@ -3952,10 +3987,15 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>   #ifdef CONFIG_SMP
>   	if (entity_is_task(se)) {
>   		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
> +		account_llc_dequeue(rq_of(cfs_rq), task_of(se));
>   		list_del_init(&se->group_node);
>   	}
>   #endif
>   	cfs_rq->nr_queued--;
> +
> +	/* safeguard? */
> +	if (!parent_entity(se) && !cfs_rq->nr_queued)
> +		reset_llc_stats(rq_of(cfs_rq));
>   }
>   
>   /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 74eb2f3615aa..6c83a71ac8ca 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1104,6 +1104,10 @@ struct rq {
>   	unsigned int		nr_preferred_running;
>   	unsigned int		numa_migrate_on;
>   #endif
> +#ifdef CONFIG_SCHED_CACHE
> +	unsigned int		nr_pref_llc_running;
> +	unsigned int		nr_llc_running;
> +#endif
>   #ifdef CONFIG_NO_HZ_COMMON
>   #ifdef CONFIG_SMP
>   	unsigned long		last_blocked_load_update_tick;
> @@ -1948,6 +1952,9 @@ init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
>   
>   #endif /* !CONFIG_NUMA_BALANCING */
>   
> +extern void reset_llc_stats(struct rq *rq);
> +extern int task_llc(const struct task_struct *p);
> +
>   #ifdef CONFIG_SMP
>   
>   static inline void


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 14/20] sched: Introduce update_llc_busiest() to deal with groups having preferred LLC tasks
  2025-06-18 18:28 ` [RFC patch v3 14/20] sched: Introduce update_llc_busiest() to deal with groups having preferred LLC tasks Tim Chen
@ 2025-07-03 19:52   ` Shrikanth Hegde
  2025-07-05  2:26     ` Chen, Yu C
  0 siblings, 1 reply; 68+ messages in thread
From: Shrikanth Hegde @ 2025-07-03 19:52 UTC (permalink / raw)
  To: Tim Chen, Chen Yu
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy



On 6/18/25 23:58, Tim Chen wrote:
> The load balancer attempts to identify the busiest sched_group with
> the highest load and migrates some tasks to a less busy sched_group
> to distribute the load across different CPUs.
> 
> When cache-aware scheduling is enabled, the busiest sched_group is
> defined as the one with the highest number of tasks preferring to run
> on the destination LLC. If the busiest group has llc_balance tag,
> the cache aware load balance will be launched.
> 
> Introduce the helper function update_llc_busiest() to identify
> such sched group with most tasks preferring the destination LLC.
> 
> Co-developed-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>   kernel/sched/fair.c | 36 +++++++++++++++++++++++++++++++++++-
>   1 file changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48a090c6e885..ab3d1239d6e4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10848,12 +10848,36 @@ static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
>   
>   	return false;
>   }
> +
> +static bool update_llc_busiest(struct lb_env *env,
> +			       struct sg_lb_stats *busiest,
> +			       struct sg_lb_stats *sgs)
> +{
> +	int idx;
> +
> +	/* Only the candidate with llc_balance need to be taken care of */
> +	if (!sgs->group_llc_balance)
> +		return false;
> +
> +	/*
> +	 * There are more tasks that want to run on dst_cpu's LLC.
> +	 */
> +	idx = llc_idx(env->dst_cpu);
> +	return sgs->nr_pref_llc[idx] > busiest->nr_pref_llc[idx];
> +}
>   #else
>   static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
>   			       struct sched_group *group)
>   {
>   	return false;
>   }
> +
> +static bool update_llc_busiest(struct lb_env *env,
> +			       struct sg_lb_stats *busiest,
> +			       struct sg_lb_stats *sgs)
> +{
> +	return false;
> +}
>   #endif
>   
>   static inline long sibling_imbalance(struct lb_env *env,
> @@ -11085,6 +11109,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>   	     sds->local_stat.group_type != group_has_spare))
>   		return false;
>   
> +	/* deal with prefer LLC load balance, if failed, fall into normal load balance */
> +	if (update_llc_busiest(env, busiest, sgs))
> +		return true;
> +
> +	/* if there is already a busy group, skip the normal load balance */
> +	if (busiest->group_llc_balance)
> +		return false;
> +

If you had a group which was group_overloaded but it could have group_llc_balance right?
In this case the priorities based on group_type is not followed no?

>   	if (sgs->group_type > busiest->group_type)
>   		return true;
>   
> @@ -11991,9 +12023,11 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
>   	/*
>   	 * Try to move all excess tasks to a sibling domain of the busiest
>   	 * group's child domain.
> +	 * Also do so if we can move some tasks that prefer the local LLC.
>   	 */
>   	if (sds.prefer_sibling && local->group_type == group_has_spare &&
> -	    sibling_imbalance(env, &sds, busiest, local) > 1)
> +	    (busiest->group_llc_balance ||
> +	    sibling_imbalance(env, &sds, busiest, local) > 1))
>   		goto force_balance;
>   
>   	if (busiest->group_type != group_overloaded) {

Also, This load balancing happening due to llc could be very tricky to debug.
Any stats added to schedstat or sched/debug?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-24  5:00 ` K Prateek Nayak
  2025-06-24 12:16   ` Chen, Yu C
  2025-06-25  0:30   ` Tim Chen
@ 2025-07-03 20:00   ` Shrikanth Hegde
  2025-07-04 10:09     ` Chen, Yu C
  2 siblings, 1 reply; 68+ messages in thread
From: Shrikanth Hegde @ 2025-07-03 20:00 UTC (permalink / raw)
  To: K Prateek Nayak, Tim Chen, Chen Yu
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Vincent Guittot, Libo Chen,
	Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy


> 
> tl;dr
> 
> o Benchmark that prefer co-location and run in threaded mode see
>    a benefit including hackbench at high utilization and schbench
>    at low utilization.
> 
> o schbench (both new and old but particularly the old) regresses
>    quite a bit on the tial latency metric when #workers cross the
>    LLC size.
> 
> o client-server benchmarks where client and servers are threads
>    from different processes (netserver-netperf, tbench_srv-tbench,
>    services of DeathStarBench) seem to noticeably regress due to
>    lack of co-location between the communicating client and server.
> 
>    Not sure if WF_SYNC can be an indicator to temporarily ignore
>    the preferred LLC hint.
> 
> o stream regresses in some runs where the occupancy metrics trip
>    and assign a preferred LLC for all the stream threads bringing
>    down performance in !50% of the runs.
> 

- When you have SMT systems, threads will go faster if they run in ST mode.
If aggregation happens in a LLC, they might end up with lower IPC.

> Full data from my testing is as follows:
> 
> o Machine details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> 
> o Kernel details
> 
> tip:      tip:sched/core at commit 914873bc7df9 ("Merge tag
>             'x86-build-2025-05-25' of
>             git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
> 
> llc-aware-lb-v3: tip + this series as is
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 01/20] sched: Cache aware load-balancing
  2025-07-03 19:29   ` Shrikanth Hegde
@ 2025-07-04  8:40     ` Chen, Yu C
  2025-07-04  8:45       ` Peter Zijlstra
  2025-07-07 19:57     ` Tim Chen
  1 sibling, 1 reply; 68+ messages in thread
From: Chen, Yu C @ 2025-07-04  8:40 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Tim Chen

Hi Shrikanth,

On 7/4/2025 3:29 AM, Shrikanth Hegde wrote:
> 
> 
> Hi Tim, Chen,
> skimming through the series and will try to go through in coming days.
> 

Thanks for your interest in this change.

>>
>> One of the many things on the eternal todo list has been finishing the
>> below hackery.
>>
>> It is an attempt at modelling cache affinity -- and while the patch
>> really only targets LLC, it could very well be extended to also apply to
>> clusters (L2). Specifically any case of multiple cache domains inside a
>> node.
>>
>> Anyway, I wrote this about a year ago, and I mentioned this at the
>> recent OSPM conf where Gautham and Prateek expressed interest in playing
>> with this code.
>>
>> So here goes, very rough and largely unproven code ahead :-)
>>
>> It applies to current tip/master, but I know it will fail the __percpu
>> validation that sits in -next, although that shouldn't be terribly hard
>> to fix up.
>>
>> As is, it only computes a CPU inside the LLC that has the highest recent
>> runtime, this CPU is then used in the wake-up path to steer towards this
>> LLC and in task_hot() to limit migrations away from it.
>>
>> More elaborate things could be done, notably there is an XXX in there
>> somewhere about finding the best LLC inside a NODE (interaction with
>> NUMA_BALANCING).
>>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> ---

[snip...]

>> +
>>   #ifdef CONFIG_RSEQ
>>       struct rseq __user *rseq;
>>       u32 rseq_len;
>> diff --git a/init/Kconfig b/init/Kconfig
>> index bf3a920064be..e2509127b6f9 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -953,6 +953,10 @@ config NUMA_BALANCING
>>         This system will be inactive on UMA systems.
>> +config SCHED_CACHE
>> +    bool "Cache aware scheduler"
>> +    default y
>> +
> 
> Should it depend on EXPERT?
> IMO this could add quite a bit of overhead and maybe n by default?
> 

I would leave this to Peter and Tim to decide.

>>   config NUMA_BALANCING_DEFAULT_ENABLED
>>       bool "Automatically enable NUMA aware memory/task placement"
>>       default y
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 168681fc4b25..da1387823b9e 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c

[snip]

>> +#ifdef CONFIG_SCHED_CACHE
>> +
>> +/*
>> + * XXX numbers come from a place the sun don't shine -- probably 
>> wants to be SD
>> + * tunable or so.
>> + */
>> +#define EPOCH_PERIOD    (HZ/100)    /* 10 ms */
>> +#define EPOCH_OLD    5        /* 50 ms */
> 
> Have these been converted into tunables? I didn't spot that in the series.
> 

OK, they could be added into debugfs.


>> +
>> +static inline void __update_mm_sched(struct rq *rq, struct mm_sched 
>> *pcpu_sched)
>> +{
>> +    lockdep_assert_held(&rq->cpu_epoch_lock);
>> +
>> +    unsigned long n, now = jiffies;
>> +    long delta = now - rq->cpu_epoch_next;
>> +
>> +    if (delta > 0) {
>> +        n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
>> +        rq->cpu_epoch += n;
>> +        rq->cpu_epoch_next += n * EPOCH_PERIOD;
>> +        __shr_u64(&rq->cpu_runtime, n);
> 
> Another doubt i had, does this occupancy works when there is CPU 
> bandwidth controller running?
> A 50% occupancy may have different meaning when CPU bandwidth is set to 
> 50%?
> 

Even if cgroup throttle is enabled, the 50% might still indicate that
the occupancy on that CPU is real, and probably less "cache-hot".

>> +    }
>> +
>> +    n = rq->cpu_epoch - pcpu_sched->epoch;
>> +    if (n) {
>> +        pcpu_sched->epoch += n;
>> +        __shr_u64(&pcpu_sched->runtime, n);
>> +    }
>> +}
>> +
>> +static unsigned long fraction_mm_sched(struct rq *rq, struct mm_sched 
>> *pcpu_sched)
>> +{
>> +    guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
>> +
>> +    __update_mm_sched(rq, pcpu_sched);
>> +
>> +    /*
>> +     * Runtime is a geometric series (r=0.5) and as such will sum to 
>> twice
>> +     * the accumulation period, this means the multiplcation here should
>> +     * not overflow.
>> +     */
>> +    return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq- 
>> >cpu_runtime + 1);
>> +}
>> +
>> +static inline
>> +void account_mm_sched(struct rq *rq, struct task_struct *p, s64 
>> delta_exec)
>> +{
>> +    struct mm_struct *mm = p->mm;
>> +    struct mm_sched *pcpu_sched;
>> +    unsigned long epoch;
>> +
>> +    /*
>> +     * init_task and kthreads don't be having no mm
>> +     */
>> +    if (!mm || !mm->pcpu_sched)
>> +        return;
>> +
>> +    pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
>> +
>> +    scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
>> +        __update_mm_sched(rq, pcpu_sched);
>> +        pcpu_sched->runtime += delta_exec;
>> +        rq->cpu_runtime += delta_exec;
>> +        epoch = rq->cpu_epoch;
>> +    }
>> +
>> +    /*
>> +     * If this task hasn't hit task_cache_work() for a while, invalidate
>> +     * it's preferred state.
>> +     */
>> +    if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
>> +        mm->mm_sched_cpu = -1;
>> +        pcpu_sched->occ = -1;
>> +    }
>> +}
>> +
>> +static void task_tick_cache(struct rq *rq, struct task_struct *p)
>> +{
>> +    struct callback_head *work = &p->cache_work;
>> +    struct mm_struct *mm = p->mm;
>> +
>> +    if (!mm || !mm->pcpu_sched)
>> +        return;
>> +
>> +    if (mm->mm_sched_epoch == rq->cpu_epoch)
>> +        return;
>> +
>> +    guard(raw_spinlock)(&mm->mm_sched_lock);
>> +
>> +    if (mm->mm_sched_epoch == rq->cpu_epoch)
>> +        return;
>> +
>> +    if (work->next == work) {
>> +        task_work_add(p, work, TWA_RESUME);
>> +        WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
>> +    }
>> +}
>> +
>> +static void task_cache_work(struct callback_head *work)
>> +{
>> +    struct task_struct *p = current;
>> +    struct mm_struct *mm = p->mm;
>> +    unsigned long m_a_occ = 0;
>> +    int cpu, m_a_cpu = -1;
>> +    cpumask_var_t cpus;
>> +
>> +    WARN_ON_ONCE(work != &p->cache_work);
>> +
>> +    work->next = work;
>> +
>> +    if (p->flags & PF_EXITING)
>> +        return;
>> +
>> +    if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
>> +        return;
>> +
>> +    scoped_guard (cpus_read_lock) {
>> +        cpumask_copy(cpus, cpu_online_mask);
>> +
> 
> As pointed out already, this is going to be costly in multi NUMA 
> systems. Any cross NUMA access of
> CPUs data is going to add overhead to system bus bandwidth and this 
> happening at tick could be costly.
> 

Yes, we are trying to reduce the overhead of CPU scan, although this
scan happens not at every tick.

> Also, taking cpu_read_lock does preempt_disable, this could add to large 
> preemptoff?

cpus_read_lock() just disables the preemption for a short time I
suppose? If it can not get the lock, it enable the preemption and goes
to sleep.

> We need to measure the time it takes on large system. Will try and get 
> back with that number

OK, looking forward to it.


>> +        for_each_cpu(cpu, cpus) {
>> +            /* XXX sched_cluster_active */
>> +            struct sched_domain *sd = per_cpu(sd_llc, cpu);
>> +            unsigned long occ, m_occ = 0, a_occ = 0;
>> +            int m_cpu = -1, nr = 0, i;
>> +
>> +            for_each_cpu(i, sched_domain_span(sd)) {
>> +                occ = fraction_mm_sched(cpu_rq(i),
>> +                            per_cpu_ptr(mm->pcpu_sched, i));
>> +                a_occ += occ;
>> +                if (occ > m_occ) {
>> +                    m_occ = occ;
>> +                    m_cpu = i;
>> +                }
>> +                nr++;
>> +                trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: 
>> %d\n",
>> +                         per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
>> +            }
>> +
>> +            a_occ /= nr;
>> +            if (a_occ > m_a_occ) {
>> +                m_a_occ = a_occ;
>> +                m_a_cpu = m_cpu;
>> +            }
>> +
>> +            trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
>> +                     per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
>> +
>> +            for_each_cpu(i, sched_domain_span(sd)) {
>> +                /* XXX threshold ? */
>> +                per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
>> +            }
>> +
>> +            cpumask_andnot(cpus, cpus, sched_domain_span(sd));
>> +        }
>> +    }
>> +
>> +    /*
>> +     * If the max average cache occupancy is 'small' we don't care.
>> +     */
>> +    if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
>> +        m_a_cpu = -1;
>> +
>> +    mm->mm_sched_cpu = m_a_cpu;
>> +
>> +    free_cpumask_var(cpus);
>> +}
>> +
>> +void init_sched_mm(struct task_struct *p)
>> +{
>> +    struct callback_head *work = &p->cache_work;
>> +    init_task_work(work, task_cache_work);
>> +    work->next = work;
>> +}
>> +
>> +#else
>> +
>> +static inline void account_mm_sched(struct rq *rq, struct task_struct 
>> *p,
>> +                    s64 delta_exec) { }
>> +
>> +
>> +void init_sched_mm(struct task_struct *p) { }
>> +
>> +static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
>> +
>> +#endif
>> +
>> +static inline
>> +void update_curr_task(struct rq *rq, struct task_struct *p, s64 
>> delta_exec)
>>   {
>>       trace_sched_stat_runtime(p, delta_exec);
>>       account_group_exec_runtime(p, delta_exec);
>> +    account_mm_sched(rq, p, delta_exec);
>>       cgroup_account_cputime(p, delta_exec);
>>   }
> 
> AFAIU, this works and cares only about SCHED_NORMAL.
> update_curr_task called by common for RT/DL too. Maybe avoid for those?
> 

OK, will fix it.

>> @@ -1215,7 +1434,7 @@ s64 update_curr_common(struct rq *rq)
>>       delta_exec = update_curr_se(rq, &donor->se);
>>       if (likely(delta_exec > 0))
>> -        update_curr_task(donor, delta_exec);
>> +        update_curr_task(rq, donor, delta_exec);
>>       return delta_exec;
>>   }

[snip...]

>>       check_update_overutilized_status(task_rq(curr));
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 47972f34ea70..d16ccd66ca07 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1171,6 +1171,12 @@ struct rq {
>>       u64            clock_pelt_idle_copy;
>>       u64            clock_idle_copy;
>>   #endif
>> +#ifdef CONFIG_SCHED_CACHE
>> +    raw_spinlock_t        cpu_epoch_lock;
>> +    u64            cpu_runtime;
>> +    unsigned long        cpu_epoch;
>> +    unsigned long        cpu_epoch_next;
>> +#endif
> 
> Maybe these can go to their own cacheline?
> 

Sure. Do you mean there is risk to cause false
sharing, that theres *_epoch could race with
either atomic_t nr_iowait or u64 clock_idle_copy?

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 01/20] sched: Cache aware load-balancing
  2025-07-04  8:40     ` Chen, Yu C
@ 2025-07-04  8:45       ` Peter Zijlstra
  2025-07-04  8:54         ` Shrikanth Hegde
  0 siblings, 1 reply; 68+ messages in thread
From: Peter Zijlstra @ 2025-07-04  8:45 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Shrikanth Hegde, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Madadi Vineeth Reddy,
	Hillf Danton, Len Brown, linux-kernel, Ingo Molnar,
	K Prateek Nayak, Gautham R . Shenoy, Tim Chen

On Fri, Jul 04, 2025 at 04:40:39PM +0800, Chen, Yu C wrote:

> > > @@ -953,6 +953,10 @@ config NUMA_BALANCING
> > > ?????????????? This system will be inactive on UMA systems.
> > > +config SCHED_CACHE
> > > +?????? bool "Cache aware scheduler"
> > > +?????? default y
> > > +
> > 
> > Should it depend on EXPERT?
> > IMO this could add quite a bit of overhead and maybe n by default?
> > 
> 
> I would leave this to Peter and Tim to decide.

Runtime controls are always better than compile time. Distros will have
no choice but to enable the config option.

But that is not the kind of thing you start a series with. First
versions didn't even have the config option. First you make it work,
then later you worry about silly detail.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 01/20] sched: Cache aware load-balancing
  2025-07-04  8:45       ` Peter Zijlstra
@ 2025-07-04  8:54         ` Shrikanth Hegde
  0 siblings, 0 replies; 68+ messages in thread
From: Shrikanth Hegde @ 2025-07-04  8:54 UTC (permalink / raw)
  To: Peter Zijlstra, Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Tim Chen



On 7/4/25 14:15, Peter Zijlstra wrote:
> On Fri, Jul 04, 2025 at 04:40:39PM +0800, Chen, Yu C wrote:
> 
>>>> @@ -953,6 +953,10 @@ config NUMA_BALANCING
>>>> ?????????????? This system will be inactive on UMA systems.
>>>> +config SCHED_CACHE
>>>> +?????? bool "Cache aware scheduler"
>>>> +?????? default y
>>>> +
>>>
>>> Should it depend on EXPERT?
>>> IMO this could add quite a bit of overhead and maybe n by default?
>>>
>>
>> I would leave this to Peter and Tim to decide.
> 
> Runtime controls are always better than compile time. Distros will have
> no choice but to enable the config option.
> 
> But that is not the kind of thing you start a series with. First
> versions didn't even have the config option. First you make it work,
> then later you worry about silly detail.
> 

Ok Makes sense.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 08/20] sched: Set up LLC indexing
  2025-07-03 19:44   ` Shrikanth Hegde
@ 2025-07-04  9:36     ` Chen, Yu C
  0 siblings, 0 replies; 68+ messages in thread
From: Chen, Yu C @ 2025-07-04  9:36 UTC (permalink / raw)
  To: Shrikanth Hegde, Tim Chen
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy

On 7/4/2025 3:44 AM, Shrikanth Hegde wrote:
> 
> 
> On 6/18/25 23:57, Tim Chen wrote:
>> Prepare for indexing arrays that track in each run queue: the number
>> of tasks preferring current LLC and each of the other LLC.
>>
>> The reason to introduce LLC index is because the per LLC-scope data
>> is needed to do cache aware load balancing. However, the native lld_id
>> is usually the first CPU of that LLC domain, which is not continuous,
>> which might waste the space if the per LLC-scope data is stored
>> in an array (in current implementation).
>>
>> In the future, this LLC index could be removed after
>> the native llc_id is used as the key to search into xarray based
>> array.
>>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>   include/linux/sched.h   |  3 +++
>>   kernel/sched/fair.c     | 12 ++++++++++++
>>   kernel/sched/sched.h    |  2 ++
>>   kernel/sched/topology.c | 29 +++++++++++++++++++++++++++++
>>   4 files changed, 46 insertions(+)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index d0e4cda2b3cd..7ce95a32e9ff 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -810,6 +810,9 @@ struct kmap_ctrl {
>>   #endif
>>   };
>> +/* XXX need fix to not use magic number */
>> +#define MAX_LLC 64
> 
> This number needs to be much higher. maybe keeping NR_CPUS wont hurt.
> 
It will be replaced by xarray, so above restriction might not be needed
anymore.

>>   }
>> +/*
>> + * continous index.
>> + * TBD: replace by xarray with key llc_id()
>> + */
>> +static inline int llc_idx(int cpu)
>> +{
>> +    if (cpu < 0)
>> +        return -1;
>> +
>> +    return per_cpu(sd_llc_idx, cpu);
>> +}
>> +
>>   void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu 
>> *_pcpu_sched)
>>   {
>>       unsigned long epoch;
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 1c6fd45c7f62..74eb2f3615aa 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2037,6 +2037,7 @@ static inline struct sched_domain 
>> *lowest_flag_domain(int cpu, int flag)
>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>   DECLARE_PER_CPU(int, sd_llc_size);
>>   DECLARE_PER_CPU(int, sd_llc_id);
>> +DECLARE_PER_CPU(int, sd_llc_idx);
>>   DECLARE_PER_CPU(int, sd_share_id);
>>   DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>   DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>> @@ -2045,6 +2046,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, 
>> sd_asym_cpucapacity);
>>   extern struct static_key_false sched_asym_cpucapacity;
>>   extern struct static_key_false sched_cluster_active;
>> +extern int max_llcs;
>>   static __always_inline bool sched_asym_cpucap_active(void)
>>   {
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index f1ebc60d967f..b7bb13045dd8 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -672,6 +672,7 @@ static void destroy_sched_domains(struct 
>> sched_domain *sd)
>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>   DEFINE_PER_CPU(int, sd_llc_size);
>>   DEFINE_PER_CPU(int, sd_llc_id);
>> +DEFINE_PER_CPU(int, sd_llc_idx);
>>   DEFINE_PER_CPU(int, sd_share_id);
>>   DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>> @@ -681,6 +682,25 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, 
>> sd_asym_cpucapacity);
>>   DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
>>   DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
>> +int max_llcs = -1;
>> +
>> +static void update_llc_idx(int cpu)
>> +{
>> +#ifdef CONFIG_SCHED_CACHE
>> +    int idx = -1, llc_id = -1;
>> +
>> +    llc_id = per_cpu(sd_llc_id, cpu);
>> +    idx = per_cpu(sd_llc_idx, llc_id);
>> +
>> +    if (idx < 0) {
>> +        idx = max_llcs++;
>> +        BUG_ON(idx > MAX_LLC); 
> 
> maybe a warning instead here?
> 

Ditto.


thanks,
Chenyu

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-07-03 20:00   ` Shrikanth Hegde
@ 2025-07-04 10:09     ` Chen, Yu C
  0 siblings, 0 replies; 68+ messages in thread
From: Chen, Yu C @ 2025-07-04 10:09 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Vincent Guittot, Libo Chen,
	Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy,
	K Prateek Nayak, Tim Chen

On 7/4/2025 4:00 AM, Shrikanth Hegde wrote:
> 
>>
>> tl;dr
>>
>> o Benchmark that prefer co-location and run in threaded mode see
>>    a benefit including hackbench at high utilization and schbench
>>    at low utilization.
>>
>> o schbench (both new and old but particularly the old) regresses
>>    quite a bit on the tial latency metric when #workers cross the
>>    LLC size.
>>
>> o client-server benchmarks where client and servers are threads
>>    from different processes (netserver-netperf, tbench_srv-tbench,
>>    services of DeathStarBench) seem to noticeably regress due to
>>    lack of co-location between the communicating client and server.
>>
>>    Not sure if WF_SYNC can be an indicator to temporarily ignore
>>    the preferred LLC hint.
>>
>> o stream regresses in some runs where the occupancy metrics trip
>>    and assign a preferred LLC for all the stream threads bringing
>>    down performance in !50% of the runs.
>>
> 
> - When you have SMT systems, threads will go faster if they run in ST mode.
> If aggregation happens in a LLC, they might end up with lower IPC.
> 

OK, the number of SMT within a core should also be considered to
control how aggressive the aggregation is.

Regarding the regression from the stream, it was caused by the working
set size. When the working set size is 2.9G in Prateek's test scenario,
there is a regression with task aggregation. If we reduce it to a lower
value, say 512MB, the regression disappears. Therefore, we are trying to
tweak this by comparing the process's RSS with the L3 cache size.

thanks,
Chenyu

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 10/20] sched: Calculate the number of tasks that have LLC preference on a runqueue
  2025-07-03 19:45   ` Shrikanth Hegde
@ 2025-07-04 15:00     ` Chen, Yu C
  0 siblings, 0 replies; 68+ messages in thread
From: Chen, Yu C @ 2025-07-04 15:00 UTC (permalink / raw)
  To: Shrikanth Hegde, Tim Chen
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy

On 7/4/2025 3:45 AM, Shrikanth Hegde wrote:
> 
> 
> On 6/18/25 23:57, Tim Chen wrote:
>> Track for each run queue, the number of tasks that have a LLC preference
>> and how many of those tasks are running in its preferred LLC.  This is
>> similar to nr_numa_running and nr_preferred_running for NUMA balance,
>> and will be used by the cache-aware load balancing in subsequent patches.
>>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>   kernel/sched/core.c  | 12 ++++++++++++
>>   kernel/sched/fair.c  | 42 +++++++++++++++++++++++++++++++++++++++++-
>>   kernel/sched/sched.h |  7 +++++++
>>   3 files changed, 60 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index d9c3e75f79d1..34056eb79ef2 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -498,6 +498,18 @@ void __trace_set_current_state(int state_value)
>>   }
>>   EXPORT_SYMBOL(__trace_set_current_state);
>> +#ifdef CONFIG_SMP
> 
> 
> CONFIG_SMP is true unconditionally now. Else may need to go.
> 

OK. I suppose it will take effect from 6.17? We can remove this control
after rebasing to that version.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 14/20] sched: Introduce update_llc_busiest() to deal with groups having preferred LLC tasks
  2025-07-03 19:52   ` Shrikanth Hegde
@ 2025-07-05  2:26     ` Chen, Yu C
  0 siblings, 0 replies; 68+ messages in thread
From: Chen, Yu C @ 2025-07-05  2:26 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Tim Chen

On 7/4/2025 3:52 AM, Shrikanth Hegde wrote:
> 
> 
> On 6/18/25 23:58, Tim Chen wrote:
>> The load balancer attempts to identify the busiest sched_group with
>> the highest load and migrates some tasks to a less busy sched_group
>> to distribute the load across different CPUs.
>>
>> When cache-aware scheduling is enabled, the busiest sched_group is
>> defined as the one with the highest number of tasks preferring to run
>> on the destination LLC. If the busiest group has llc_balance tag,
>> the cache aware load balance will be launched.
>>
>> Introduce the helper function update_llc_busiest() to identify
>> such sched group with most tasks preferring the destination LLC.
>>
>> Co-developed-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>   kernel/sched/fair.c | 36 +++++++++++++++++++++++++++++++++++-
>>   1 file changed, 35 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 48a090c6e885..ab3d1239d6e4 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10848,12 +10848,36 @@ static inline bool llc_balance(struct lb_env 
>> *env, struct sg_lb_stats *sgs,
>>       return false;
>>   }
>> +
>> +static bool update_llc_busiest(struct lb_env *env,
>> +                   struct sg_lb_stats *busiest,
>> +                   struct sg_lb_stats *sgs)
>> +{
>> +    int idx;
>> +
>> +    /* Only the candidate with llc_balance need to be taken care of */
>> +    if (!sgs->group_llc_balance)
>> +        return false;
>> +
>> +    /*
>> +     * There are more tasks that want to run on dst_cpu's LLC.
>> +     */
>> +    idx = llc_idx(env->dst_cpu);
>> +    return sgs->nr_pref_llc[idx] > busiest->nr_pref_llc[idx];
>> +}
>>   #else
>>   static inline bool llc_balance(struct lb_env *env, struct 
>> sg_lb_stats *sgs,
>>                      struct sched_group *group)
>>   {
>>       return false;
>>   }
>> +
>> +static bool update_llc_busiest(struct lb_env *env,
>> +                   struct sg_lb_stats *busiest,
>> +                   struct sg_lb_stats *sgs)
>> +{
>> +    return false;
>> +}
>>   #endif
>>   static inline long sibling_imbalance(struct lb_env *env,
>> @@ -11085,6 +11109,14 @@ static bool update_sd_pick_busiest(struct 
>> lb_env *env,
>>            sds->local_stat.group_type != group_has_spare))
>>           return false;
>> +    /* deal with prefer LLC load balance, if failed, fall into normal 
>> load balance */
>> +    if (update_llc_busiest(env, busiest, sgs))
>> +        return true;
>> +
>> +    /* if there is already a busy group, skip the normal load balance */
>> +    if (busiest->group_llc_balance)
>> +        return false;
>> +
> 
> If you had a group which was group_overloaded but it could have 
> group_llc_balance right?

Yes.

> In this case the priorities based on group_type is not followed no?
> 

Currently, group_llc_balance appears to take precedence over the
normal group_type. The setting of group_llc_balance is determined by
_get_migrate_hint(). We've made efforts to set this flag carefully to
avoid disrupting the normal load balancing.

For example, group_llc_balance won't be enabled when both the destination
LLC and source LLC surpass 50% of the average utilization. As for
group_overloaded, its threshold is set at 85% utilization 
(imbalance_pct=117).
So in this case, the group_overloaded would be honored.

>>       if (sgs->group_type > busiest->group_type)
>>           return true;
>> @@ -11991,9 +12023,11 @@ static struct sched_group 
>> *sched_balance_find_src_group(struct lb_env *env)
>>       /*
>>        * Try to move all excess tasks to a sibling domain of the busiest
>>        * group's child domain.
>> +     * Also do so if we can move some tasks that prefer the local LLC.
>>        */
>>       if (sds.prefer_sibling && local->group_type == group_has_spare &&
>> -        sibling_imbalance(env, &sds, busiest, local) > 1)
>> +        (busiest->group_llc_balance ||
>> +        sibling_imbalance(env, &sds, busiest, local) > 1))
>>           goto force_balance;
>>       if (busiest->group_type != group_overloaded) {
> 
> Also, This load balancing happening due to llc could be very tricky to 
> debug.
> Any stats added to schedstat or sched/debug?

OK, we can add some in the next version.

Thanks,
Chenyu

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 04/20] sched: Avoid calculating the cpumask if the system is overloaded
  2025-07-03 19:39   ` Shrikanth Hegde
@ 2025-07-07 14:57     ` Tim Chen
  0 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-07-07 14:57 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra, Ingo Molnar, K Prateek Nayak
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Chen Yu, Gautham R . Shenoy

On Fri, 2025-07-04 at 01:09 +0530, Shrikanth Hegde wrote:
> 
> On 6/18/25 23:57, Tim Chen wrote:
> > From: K Prateek Nayak <kprateek.nayak@amd.com>
> > 
> > If the SIS_UTIL cuts off idle cpu search, result of the cpumask_and() is
> > of no use. Since select_idle_cpu() can now be called twice per wake up
> > in the select_idle_sibling() due to cache aware wake up, this overhead
> > can be visible in benchmarks like hackbench.
> > 
> > To save some additional cycles, especially in cases where we target
> > the LLC frequently and the search bails out because the LLC is busy,
> > only calculate the cpumask if the system is not overloaded.
> > 
> 
> This patch could be independent and should help in general.
> But changelog needs to be updated.
> 
> 

Yes, that makes sense.

Tim

> > Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > ---
> >   kernel/sched/fair.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 567ad2a0cfa2..6a2678f9d44a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7918,8 +7918,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >   	int i, cpu, idle_cpu = -1, nr = INT_MAX;
> >   	struct sched_domain_shared *sd_share;
> >   
> > -	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > -
> >   	if (sched_feat(SIS_UTIL)) {
> >   		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> >   		if (sd_share) {
> > @@ -7931,6 +7929,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >   		}
> >   	}
> >   
> > +	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> > +
> >   	if (static_branch_unlikely(&sched_cluster_active)) {
> >   		struct sched_group *sg = sd->groups;
> >   
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 01/20] sched: Cache aware load-balancing
  2025-07-03 19:29   ` Shrikanth Hegde
  2025-07-04  8:40     ` Chen, Yu C
@ 2025-07-07 19:57     ` Tim Chen
  1 sibling, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-07-07 19:57 UTC (permalink / raw)
  To: Shrikanth Hegde, Chen Yu
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy

On Fri, 2025-07-04 at 00:59 +0530, Shrikanth Hegde wrote:
> 
> Hi Tim, Chen,
> skimming through the series and will try to go through in coming days.
> 
> > 

Thanks for taking a look.  Some further comments on top of Chen Yu's response.

[snip]
> >   
> > +#ifdef CONFIG_SCHED_CACHE
> > +	struct callback_head		cache_work;
> > +#endif
> > +
> >   #ifdef CONFIG_RSEQ
> >   	struct rseq __user *rseq;
> >   	u32 rseq_len;
> > diff --git a/init/Kconfig b/init/Kconfig
> > index bf3a920064be..e2509127b6f9 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -953,6 +953,10 @@ config NUMA_BALANCING
> >   
> >   	  This system will be inactive on UMA systems.
> >   
> > +config SCHED_CACHE
> > +	bool "Cache aware scheduler"
> > +	default y
> > +
> 
> Should it depend on EXPERT?
> IMO this could add quite a bit of overhead and maybe n by default?
> 

We do have a SCHED_CACHE scheduler feature in the later patches.  
So the feature could be turned on/off at run time by admin who don't want
to incur this overhead.

> >   config NUMA_BALANCING_DEFAULT_ENABLED
> >   	bool "Automatically enable NUMA aware memory/task placement"
> >   	default y
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 168681fc4b25..da1387823b9e 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1332,6 +1332,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> >   	if (mm_alloc_cid(mm, p))
> >   		goto fail_cid;
> >   

[snip]
> 
> > +
> > +static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
> > +{
> > +	lockdep_assert_held(&rq->cpu_epoch_lock);
> > +
> > +	unsigned long n, now = jiffies;
> > +	long delta = now - rq->cpu_epoch_next;
> > +
> > +	if (delta > 0) {
> > +		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
> > +		rq->cpu_epoch += n;
> > +		rq->cpu_epoch_next += n * EPOCH_PERIOD;
> > +		__shr_u64(&rq->cpu_runtime, n);
> 
> Another doubt i had, does this occupancy works when there is CPU bandwidth controller running?
> A 50% occupancy may have different meaning when CPU bandwidth is set to 50%?

The occupancy is used to compare tasks occupancy within a process.  With bandwidth
controller set to 50%, it just mean that all tasks in the process will run 50% less,
but the relative occupancy ratio between tasks should still remain the same.

[snip]
> 
> > +	}
> > 
> > +
> > +static void task_cache_work(struct callback_head *work)
> > +{
> > +	struct task_struct *p = current;
> > +	struct mm_struct *mm = p->mm;
> > +	unsigned long m_a_occ = 0;
> > +	int cpu, m_a_cpu = -1;
> > +	cpumask_var_t cpus;
> > +
> > +	WARN_ON_ONCE(work != &p->cache_work);
> > +
> > +	work->next = work;
> > +
> > +	if (p->flags & PF_EXITING)
> > +		return;
> > +
> > +	if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
> > +		return;
> > +
> > +	scoped_guard (cpus_read_lock) {
> > +		cpumask_copy(cpus, cpu_online_mask);
> > +
> 
> As pointed out already, this is going to be costly in multi NUMA systems. Any cross NUMA access of
> CPUs data is going to add overhead to system bus bandwidth and this happening at tick could be costly.
> 

We'll consider restricting the scan on preferred NUMA node (if numa balancing is running), which
should greatly reduce the overhead. 

> Also, taking cpu_read_lock does preempt_disable, this could add to large preemptoff?
> We need to measure the time it takes on large system. Will try and get back with that number

Tim

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling
  2025-07-03 19:33   ` Shrikanth Hegde
@ 2025-07-07 21:02     ` Tim Chen
  0 siblings, 0 replies; 68+ messages in thread
From: Tim Chen @ 2025-07-07 21:02 UTC (permalink / raw)
  To: Shrikanth Hegde, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel

On Fri, 2025-07-04 at 01:03 +0530, Shrikanth Hegde wrote:
> 
> 
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 3c12d9f93331..d2af7bfd36bf 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -87,6 +87,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> >    */
> >   SCHED_FEAT(SIS_UTIL, true)
> >   
> > +SCHED_FEAT(SCHED_CACHE, true)
> 
> Having both SCHED_FEAT and CONFIG_SCHED_CACHE seems like overkill.
> Is it really necessary to have both?

As Peter pointed out previously, a runtime knob is still preferable.

> 
> Also, given the complexity it brings and only a workloads which spawns threads
> which have data sharing among them benefit, it could be false by default.

That's true. We'll try to address such cases in the next version with a
default behavior that's more conservative.

Tim
> 
> >   /*
> >    * Issue a WARN when we do multiple update_rq_clock() calls
> >    * in a single rq->lock section. Default disabled because the
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling
  2025-06-18 18:27 ` [RFC patch v3 07/20] sched: Add helper function to decide whether to allow " Tim Chen
@ 2025-07-08  0:41   ` Libo Chen
  2025-07-08  8:29     ` Chen, Yu C
  2025-07-08 21:59     ` Tim Chen
  0 siblings, 2 replies; 68+ messages in thread
From: Libo Chen @ 2025-07-08  0:41 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Chen Yu

Hi Tim and Chenyu,


On 6/18/25 11:27, Tim Chen wrote:
> Cache-aware scheduling is designed to aggregate threads into their
> preferred LLC, either via the task wake up path or the load balancing
> path. One side effect is that when the preferred LLC is saturated,
> more threads will continue to be stacked on it, degrading the workload's
> latency. A strategy is needed to prevent this aggregation from going too
> far such that the preferred LLC is too overloaded.
> 
> Introduce helper function _get_migrate_hint() to implement the LLC
> migration policy:
> 
> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>    are not too busy (<50% utilization, tunable), or the preferred
>    LLC will not be too out of balanced from the non preferred LLC
>    (>20% utilization, tunable, close to imbalance_pct of the LLC
>    domain).
> 2) Allow a task to be moved from the preferred LLC to the
>    non-preferred one if the non-preferred LLC will not be too out
>    of balanced from the preferred prompting an aggregation task
>    migration later.  We are still experimenting with the aggregation
>    and migration policy. Some other possibilities are policy based
>    on LLC's load or average number of tasks running.  Those could
>    be tried out by tweaking _get_migrate_hint().
> 
> The function _get_migrate_hint() returns migration suggestions for the upper-le
> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
> +


I think this patch has a great potential.

Since _get_migrate_hint() is tied to an individual task anyway, why not add a
per-task llc_aggr_imb which defaults to the sysctl one? Tasks have different
preferences for llc stacking, they can all be running in the same system at the
same time. This way you can offer a greater deal of optimization without much
burden to others.

Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE? Does setting
sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?

Thanks,
Libo

> +static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
> +					   unsigned long tsk_util,
> +					   bool to_pref)
> +{
> +	unsigned long src_util, dst_util, src_cap, dst_cap;
> +
> +	if (cpus_share_cache(src_cpu, dst_cpu))
> +		return mig_allow;
> +
> +	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
> +	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
> +		return mig_allow;
> +
> +	if (!fits_llc_capacity(dst_util, dst_cap) &&
> +	    !fits_llc_capacity(src_util, src_cap))
> +		return mig_ignore;
> +
> +	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
> +	dst_util = dst_util + tsk_util;
> +	if (to_pref) {
> +		/*
> +		 * sysctl_llc_aggr_imb is the imbalance allowed between
> +		 * preferred LLC and non-preferred LLC.
> +		 * Don't migrate if we will get preferred LLC too
> +		 * heavily loaded and if the dest is much busier
> +		 * than the src, in which case migration will
> +		 * increase the imbalance too much.
> +		 */
> +		if (!fits_llc_capacity(dst_util, dst_cap) &&
> +		    util_greater(dst_util, src_util))
> +			return mig_forbid;
> +	} else {
> +		/*
> +		 * Don't migrate if we will leave preferred LLC
> +		 * too idle, or if this migration leads to the
> +		 * non-preferred LLC falls within sysctl_aggr_imb percent
> +		 * of preferred LLC, leading to migration again
> +		 * back to preferred LLC.
> +		 */
> +		if (fits_llc_capacity(src_util, src_cap) ||
> +		    !util_greater(src_util, dst_util))
> +			return mig_forbid;
> +	}
> +	return mig_allow;
> +}



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling
  2025-06-18 18:27 ` [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling Tim Chen
  2025-07-03 19:33   ` Shrikanth Hegde
@ 2025-07-08  1:15   ` Libo Chen
  2025-07-08  7:54     ` Chen, Yu C
  1 sibling, 1 reply; 68+ messages in thread
From: Libo Chen @ 2025-07-08  1:15 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel

Hi Chenyu

On 6/18/25 11:27, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
> 
> 1. Fix compile error on percpu allocation.
> 2. Enqueue to the target CPU rather than the current CPU.
> 3. NULL LLC sched domain check(Libo Chen).

Can I suggest we completely disable cache-aware scheduling
for systems without any LLC in the next version? No more added
fields, function code for them. This info should be easily
determinable during bootup while building up the topology,
and cannot be modified during runtime. Sometimes it's not
possible for distros to disable it in kconfig just for one
particular CPU, and SCHED_CACHE_LB isn't enough for removing
the added fields and users can turn it back on anyway.

Thanks,
Libo


> 4. Introduce sched feature SCHED_CACHE to control cache aware scheduling
> 5. Fix unsigned occupancy initialization to -1.
> 6. If there is only 1 thread in the process, no need to enable cache
>    awareness
> 7. Add __maybe_unused to __migrate_degrades_locality() to
>    avoid compile warnings.
> 
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  include/linux/mm_types.h |  4 ++--
>  kernel/sched/fair.c      | 27 ++++++++++++++++-----------
>  kernel/sched/features.h  |  1 +
>  3 files changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 013291c6aaa2..9de4a0a13c4d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1411,11 +1411,11 @@ static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumas
>  #endif /* CONFIG_SCHED_MM_CID */
>  
>  #ifdef CONFIG_SCHED_CACHE
> -extern void mm_init_sched(struct mm_struct *mm, struct mm_sched *pcpu_sched);
> +extern void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sched);
>  
>  static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
>  {
> -	struct mm_sched *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
> +	struct mm_sched __percpu *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
>  	if (!pcpu_sched)
>  		return -ENOMEM;
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df7d4a324fbe..89db97f8ef02 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1175,7 +1175,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
>  #define EPOCH_PERIOD	(HZ/100)	/* 10 ms */
>  #define EPOCH_OLD	5		/* 50 ms */
>  
> -void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched)
> +void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
>  {
>  	unsigned long epoch;
>  	int i;
> @@ -1186,7 +1186,7 @@ void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched)
>  
>  		pcpu_sched->runtime = 0;
>  		pcpu_sched->epoch = epoch = rq->cpu_epoch;
> -		pcpu_sched->occ = -1;
> +		pcpu_sched->occ = 0;
>  	}
>  
>  	raw_spin_lock_init(&mm->mm_sched_lock);
> @@ -1254,7 +1254,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>  	if (!mm || !mm->pcpu_sched)
>  		return;
>  
> -	pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
> +	pcpu_sched = per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq));
>  
>  	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
>  		__update_mm_sched(rq, pcpu_sched);
> @@ -1264,12 +1264,14 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>  	}
>  
>  	/*
> -	 * If this task hasn't hit task_cache_work() for a while, invalidate
> +	 * If this task hasn't hit task_cache_work() for a while, or it
> +	 * has only 1 thread, invalidate
>  	 * it's preferred state.
>  	 */
> -	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
> +	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD ||
> +	    get_nr_threads(p) <= 1) {
>  		mm->mm_sched_cpu = -1;
> -		pcpu_sched->occ = -1;
> +		pcpu_sched->occ = 0;
>  	}
>  }
>  
> @@ -1286,9 +1288,6 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
>  
>  	guard(raw_spinlock)(&mm->mm_sched_lock);
>  
> -	if (mm->mm_sched_epoch == rq->cpu_epoch)
> -		return;
> -
>  	if (work->next == work) {
>  		task_work_add(p, work, TWA_RESUME);
>  		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
> @@ -1322,6 +1321,9 @@ static void task_cache_work(struct callback_head *work)
>  			unsigned long occ, m_occ = 0, a_occ = 0;
>  			int m_cpu = -1, nr = 0, i;
>  
> +			if (!sd)
> +				continue;
> +
>  			for_each_cpu(i, sched_domain_span(sd)) {
>  				occ = fraction_mm_sched(cpu_rq(i),
>  							per_cpu_ptr(mm->pcpu_sched, i));
> @@ -8801,6 +8803,9 @@ static int select_cache_cpu(struct task_struct *p, int prev_cpu)
>  	struct mm_struct *mm = p->mm;
>  	int cpu;
>  
> +	if (!sched_feat(SCHED_CACHE))
> +		return prev_cpu;
> +
>  	if (!mm || p->nr_cpus_allowed == 1)
>  		return prev_cpu;
>  
> @@ -9555,7 +9560,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>  		return 0;
>  
>  #ifdef CONFIG_SCHED_CACHE
> -	if (p->mm && p->mm->pcpu_sched) {
> +	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->pcpu_sched) {
>  		/*
>  		 * XXX things like Skylake have non-inclusive L3 and might not
>  		 * like this L3 centric view. What to do about L2 stickyness ?
> @@ -9633,7 +9638,7 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
>  }
>  
>  #else
> -static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle)
> +static __maybe_unused long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle)
>  {
>  	return 0;
>  }
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 3c12d9f93331..d2af7bfd36bf 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -87,6 +87,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>   */
>  SCHED_FEAT(SIS_UTIL, true)
>  
> +SCHED_FEAT(SCHED_CACHE, true)
>  /*
>   * Issue a WARN when we do multiple update_rq_clock() calls
>   * in a single rq->lock section. Default disabled because the


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling
  2025-07-08  1:15   ` Libo Chen
@ 2025-07-08  7:54     ` Chen, Yu C
  2025-07-08 15:47       ` Libo Chen
  0 siblings, 1 reply; 68+ messages in thread
From: Chen, Yu C @ 2025-07-08  7:54 UTC (permalink / raw)
  To: Libo Chen
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Tim Chen, Ingo Molnar, K Prateek Nayak,
	Peter Zijlstra, Gautham R . Shenoy

On 7/8/2025 9:15 AM, Libo Chen wrote:
> Hi Chenyu
> 
> On 6/18/25 11:27, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> 1. Fix compile error on percpu allocation.
>> 2. Enqueue to the target CPU rather than the current CPU.
>> 3. NULL LLC sched domain check(Libo Chen).
> 
> Can I suggest we completely disable cache-aware scheduling
> for systems without any LLC in the next version? No more added
> fields, function code for them. This info should be easily
> determinable during bootup while building up the topology,
> and cannot be modified during runtime. Sometimes it's not
> possible for distros to disable it in kconfig just for one
> particular CPU, and SCHED_CACHE_LB isn't enough for removing
> the added fields and users can turn it back on anyway.
> 

Good point, my understanding is that we should introduce
a static key similar to sched_smt_present to get rid of the
cache-aware scheduling code path if either LLC is not present
or there is only 1 LLC within the Node.

Thanks,
Chenyu

> Thanks,
> Libo
> 
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling
  2025-07-08  0:41   ` Libo Chen
@ 2025-07-08  8:29     ` Chen, Yu C
  2025-07-08 17:22       ` Libo Chen
  2025-07-08 21:59     ` Tim Chen
  1 sibling, 1 reply; 68+ messages in thread
From: Chen, Yu C @ 2025-07-08  8:29 UTC (permalink / raw)
  To: Libo Chen
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Tim Chen, Peter Zijlstra, Ingo Molnar,
	K Prateek Nayak, Gautham R . Shenoy

On 7/8/2025 8:41 AM, Libo Chen wrote:
> Hi Tim and Chenyu,
> 
> 
> On 6/18/25 11:27, Tim Chen wrote:
>> Cache-aware scheduling is designed to aggregate threads into their
>> preferred LLC, either via the task wake up path or the load balancing
>> path. One side effect is that when the preferred LLC is saturated,
>> more threads will continue to be stacked on it, degrading the workload's
>> latency. A strategy is needed to prevent this aggregation from going too
>> far such that the preferred LLC is too overloaded.
>>
>> Introduce helper function _get_migrate_hint() to implement the LLC
>> migration policy:
>>
>> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>>     are not too busy (<50% utilization, tunable), or the preferred
>>     LLC will not be too out of balanced from the non preferred LLC
>>     (>20% utilization, tunable, close to imbalance_pct of the LLC
>>     domain).
>> 2) Allow a task to be moved from the preferred LLC to the
>>     non-preferred one if the non-preferred LLC will not be too out
>>     of balanced from the preferred prompting an aggregation task
>>     migration later.  We are still experimenting with the aggregation
>>     and migration policy. Some other possibilities are policy based
>>     on LLC's load or average number of tasks running.  Those could
>>     be tried out by tweaking _get_migrate_hint().
>>
>> The function _get_migrate_hint() returns migration suggestions for the upper-le
>> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
>> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
>> +
> 
> 
> I think this patch has a great potential.
> 
> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
> per-task llc_aggr_imb which defaults to the sysctl one? Tasks have different
> preferences for llc stacking, they can all be running in the same system at the
> same time. This way you can offer a greater deal of optimization without much
> burden to others.

Yes, this doable. It can be evaluated after the global generic strategy
has been verified to work, like NUMA balancing :)

> 
> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE? 

Do you mean the SCHED_CACHE_WAKE or SCHED_CACHE_LB?

> Does setting sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?
> 

My understanding is that, if sysctl_llc_aggr_imb is 0, the task aggregation
might still consider other aspects, like if that target LLC's 
utilization has
exceeded 50% or not.

thanks,
Chenyu> Thanks,
> Libo
> 
>> +static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
>> +					   unsigned long tsk_util,
>> +					   bool to_pref)
>> +{
>> +	unsigned long src_util, dst_util, src_cap, dst_cap;
>> +
>> +	if (cpus_share_cache(src_cpu, dst_cpu))
>> +		return mig_allow;
>> +
>> +	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
>> +	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
>> +		return mig_allow;
>> +
>> +	if (!fits_llc_capacity(dst_util, dst_cap) &&
>> +	    !fits_llc_capacity(src_util, src_cap))
>> +		return mig_ignore;
>> +
>> +	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
>> +	dst_util = dst_util + tsk_util;
>> +	if (to_pref) {
>> +		/*
>> +		 * sysctl_llc_aggr_imb is the imbalance allowed between
>> +		 * preferred LLC and non-preferred LLC.
>> +		 * Don't migrate if we will get preferred LLC too
>> +		 * heavily loaded and if the dest is much busier
>> +		 * than the src, in which case migration will
>> +		 * increase the imbalance too much.
>> +		 */
>> +		if (!fits_llc_capacity(dst_util, dst_cap) &&
>> +		    util_greater(dst_util, src_util))
>> +			return mig_forbid;
>> +	} else {
>> +		/*
>> +		 * Don't migrate if we will leave preferred LLC
>> +		 * too idle, or if this migration leads to the
>> +		 * non-preferred LLC falls within sysctl_aggr_imb percent
>> +		 * of preferred LLC, leading to migration again
>> +		 * back to preferred LLC.
>> +		 */
>> +		if (fits_llc_capacity(src_util, src_cap) ||
>> +		    !util_greater(src_util, dst_util))
>> +			return mig_forbid;
>> +	}
>> +	return mig_allow;
>> +}
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling
  2025-07-08  7:54     ` Chen, Yu C
@ 2025-07-08 15:47       ` Libo Chen
  0 siblings, 0 replies; 68+ messages in thread
From: Libo Chen @ 2025-07-08 15:47 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Tim Chen, Ingo Molnar, K Prateek Nayak,
	Peter Zijlstra, Gautham R . Shenoy



On 7/8/25 00:54, Chen, Yu C wrote:
> On 7/8/2025 9:15 AM, Libo Chen wrote:
>> Hi Chenyu
>>
>> On 6/18/25 11:27, Tim Chen wrote:
>>> From: Chen Yu <yu.c.chen@intel.com>
>>>
>>> 1. Fix compile error on percpu allocation.
>>> 2. Enqueue to the target CPU rather than the current CPU.
>>> 3. NULL LLC sched domain check(Libo Chen).
>>
>> Can I suggest we completely disable cache-aware scheduling
>> for systems without any LLC in the next version? No more added
>> fields, function code for them. This info should be easily
>> determinable during bootup while building up the topology,
>> and cannot be modified during runtime. Sometimes it's not
>> possible for distros to disable it in kconfig just for one
>> particular CPU, and SCHED_CACHE_LB isn't enough for removing
>> the added fields and users can turn it back on anyway.
>>
> 
> Good point, my understanding is that we should introduce
> a static key similar to sched_smt_present to get rid of the
Exactly!
> cache-aware scheduling code path if either LLC is not present
> or there is only 1 LLC within the Node.
> 
> Thanks,
> Chenyu
> 
>> Thanks,
>> Libo
>>
>>
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling
  2025-07-08  8:29     ` Chen, Yu C
@ 2025-07-08 17:22       ` Libo Chen
  2025-07-09 14:41         ` Chen, Yu C
  0 siblings, 1 reply; 68+ messages in thread
From: Libo Chen @ 2025-07-08 17:22 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Tim Chen, Peter Zijlstra, Ingo Molnar,
	K Prateek Nayak, Gautham R . Shenoy



On 7/8/25 01:29, Chen, Yu C wrote:
> On 7/8/2025 8:41 AM, Libo Chen wrote:
>> Hi Tim and Chenyu,
>>
>>
>> On 6/18/25 11:27, Tim Chen wrote:
>>> Cache-aware scheduling is designed to aggregate threads into their
>>> preferred LLC, either via the task wake up path or the load balancing
>>> path. One side effect is that when the preferred LLC is saturated,
>>> more threads will continue to be stacked on it, degrading the workload's
>>> latency. A strategy is needed to prevent this aggregation from going too
>>> far such that the preferred LLC is too overloaded.
>>>
>>> Introduce helper function _get_migrate_hint() to implement the LLC
>>> migration policy:
>>>
>>> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>>>     are not too busy (<50% utilization, tunable), or the preferred
>>>     LLC will not be too out of balanced from the non preferred LLC
>>>     (>20% utilization, tunable, close to imbalance_pct of the LLC
>>>     domain).
>>> 2) Allow a task to be moved from the preferred LLC to the
>>>     non-preferred one if the non-preferred LLC will not be too out
>>>     of balanced from the preferred prompting an aggregation task
>>>     migration later.  We are still experimenting with the aggregation
>>>     and migration policy. Some other possibilities are policy based
>>>     on LLC's load or average number of tasks running.  Those could
>>>     be tried out by tweaking _get_migrate_hint().
>>>
>>> The function _get_migrate_hint() returns migration suggestions for the upper-le
>>> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
>>> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
>>> +
>>
>>
>> I think this patch has a great potential.
>>
>> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
>> per-task llc_aggr_imb which defaults to the sysctl one? Tasks have different
>> preferences for llc stacking, they can all be running in the same system at the
>> same time. This way you can offer a greater deal of optimization without much
>> burden to others.
> 
> Yes, this doable. It can be evaluated after the global generic strategy
> has been verified to work, like NUMA balancing :)
> 

I will run some real-world workloads and get back to you (may take some time)

>>
>> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE? 
> 
> Do you mean the SCHED_CACHE_WAKE or SCHED_CACHE_LB?
> 

Ah I was thinking sysctl_llc_aggr_imb alone can help reduce overstacking on
target LLC from a few hyperactive wakees (may consider to ratelimit those
wakees as a solution), but just realize this can affect lb as well and doesn't
really reduce overheads from frequent wakeups (no good idea on top of my head
but we should find a better solution than sched_feat to address the overhead issue).



>> Does setting sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?
>>
> 
> My understanding is that, if sysctl_llc_aggr_imb is 0, the task aggregation
> might still consider other aspects, like if that target LLC's utilization has
> exceeded 50% or not.
> 

which can be controlled by sysctl_llc_aggr_cap, right? Okay so if both LLCs have
<$(sysctl_llc_aggr_cap)% utilization, should sysctl_llc_aggr_cap be the only
determining factor here barring NUMA balancing?

Libo

> thanks,
> Chenyu> Thanks,
>> Libo
>>
>>> +static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
>>> +                       unsigned long tsk_util,
>>> +                       bool to_pref)
>>> +{
>>> +    unsigned long src_util, dst_util, src_cap, dst_cap;
>>> +
>>> +    if (cpus_share_cache(src_cpu, dst_cpu))
>>> +        return mig_allow;
>>> +
>>> +    if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
>>> +        !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
>>> +        return mig_allow;
>>> +
>>> +    if (!fits_llc_capacity(dst_util, dst_cap) &&
>>> +        !fits_llc_capacity(src_util, src_cap))
>>> +        return mig_ignore;
>>> +
>>> +    src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
>>> +    dst_util = dst_util + tsk_util;
>>> +    if (to_pref) {
>>> +        /*
>>> +         * sysctl_llc_aggr_imb is the imbalance allowed between
>>> +         * preferred LLC and non-preferred LLC.
>>> +         * Don't migrate if we will get preferred LLC too
>>> +         * heavily loaded and if the dest is much busier
>>> +         * than the src, in which case migration will
>>> +         * increase the imbalance too much.
>>> +         */
>>> +        if (!fits_llc_capacity(dst_util, dst_cap) &&
>>> +            util_greater(dst_util, src_util))
>>> +            return mig_forbid;
>>> +    } else {
>>> +        /*
>>> +         * Don't migrate if we will leave preferred LLC
>>> +         * too idle, or if this migration leads to the
>>> +         * non-preferred LLC falls within sysctl_aggr_imb percent
>>> +         * of preferred LLC, leading to migration again
>>> +         * back to preferred LLC.
>>> +         */
>>> +        if (fits_llc_capacity(src_util, src_cap) ||
>>> +            !util_greater(src_util, dst_util))
>>> +            return mig_forbid;
>>> +    }
>>> +    return mig_allow;
>>> +}
>>
>>


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling
  2025-07-08  0:41   ` Libo Chen
  2025-07-08  8:29     ` Chen, Yu C
@ 2025-07-08 21:59     ` Tim Chen
  2025-07-09 21:22       ` Libo Chen
  1 sibling, 1 reply; 68+ messages in thread
From: Tim Chen @ 2025-07-08 21:59 UTC (permalink / raw)
  To: Libo Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Chen Yu

On Mon, 2025-07-07 at 17:41 -0700, Libo Chen wrote:
> Hi Tim and Chenyu,
> 
> 
> On 6/18/25 11:27, Tim Chen wrote:
> > Cache-aware scheduling is designed to aggregate threads into their
> > preferred LLC, either via the task wake up path or the load balancing
> > path. One side effect is that when the preferred LLC is saturated,
> > more threads will continue to be stacked on it, degrading the workload's
> > latency. A strategy is needed to prevent this aggregation from going too
> > far such that the preferred LLC is too overloaded.
> > 
> > Introduce helper function _get_migrate_hint() to implement the LLC
> > migration policy:
> > 
> > 1) A task is aggregated to its preferred LLC if both source/dest LLC
> >    are not too busy (<50% utilization, tunable), or the preferred
> >    LLC will not be too out of balanced from the non preferred LLC
> >    (>20% utilization, tunable, close to imbalance_pct of the LLC
> >    domain).
> > 2) Allow a task to be moved from the preferred LLC to the
> >    non-preferred one if the non-preferred LLC will not be too out
> >    of balanced from the preferred prompting an aggregation task
> >    migration later.  We are still experimenting with the aggregation
> >    and migration policy. Some other possibilities are policy based
> >    on LLC's load or average number of tasks running.  Those could
> >    be tried out by tweaking _get_migrate_hint().
> > 
> > The function _get_migrate_hint() returns migration suggestions for the upper-le
> > +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
> > +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
> > +
> 
> 
> I think this patch has a great potential.
> 

Thanks for taking a look.

> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
> per-task llc_aggr_imb which defaults to the sysctl one? 
> 

_get_migrate_hint() could also be called from llc_balance(). At that time
we make a determination of whether we should do llc_balance() without knowing
which exact task we're going to move, but still observe the migration policy
that shouldn't cause too much imbalance.  So it may not be strictly tied to a task
in the current implementation.

> Tasks have different
> preferences for llc stacking, they can all be running in the same system at the
> same time. This way you can offer a greater deal of optimization without much
> burden to others.

You're thinking of something like a prctl knob that will bias aggregation for
some process?  Wonder if Peter has some opinion on this.

> 
> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE?
> 

Actually we think that we can do without SCHED_CACHE_WAKE feature and rely only
on load balance SCHED_CACHE_LB.  But still keeping 

>  Does setting
> sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?

Aggregation will tend to make utilization on the preferred LLC to be more
than the non-preferred one.  Parameter "sysctl_llc_aggr_imb" is the imbalance
allowed.  If we set this to 0, as long as the preferred LLC is not utilized
more than the source LLC, we could still aggregate towards the preferred LLC
and a preference could still be there.  

Tim

> 
> Thanks,
> Libo
> 
> > +static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
> > +					   unsigned long tsk_util,
> > +					   bool to_pref)
> > +{
> > +	unsigned long src_util, dst_util, src_cap, dst_cap;
> > +
> > +	if (cpus_share_cache(src_cpu, dst_cpu))
> > +		return mig_allow;
> > +
> > +	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
> > +	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
> > +		return mig_allow;
> > +
> > +	if (!fits_llc_capacity(dst_util, dst_cap) &&
> > +	    !fits_llc_capacity(src_util, src_cap))
> > +		return mig_ignore;
> > +
> > +	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
> > +	dst_util = dst_util + tsk_util;
> > +	if (to_pref) {
> > +		/*
> > +		 * sysctl_llc_aggr_imb is the imbalance allowed between
> > +		 * preferred LLC and non-preferred LLC.
> > +		 * Don't migrate if we will get preferred LLC too
> > +		 * heavily loaded and if the dest is much busier
> > +		 * than the src, in which case migration will
> > +		 * increase the imbalance too much.
> > +		 */
> > +		if (!fits_llc_capacity(dst_util, dst_cap) &&
> > +		    util_greater(dst_util, src_util))
> > +			return mig_forbid;
> > +	} else {
> > +		/*
> > +		 * Don't migrate if we will leave preferred LLC
> > +		 * too idle, or if this migration leads to the
> > +		 * non-preferred LLC falls within sysctl_aggr_imb percent
> > +		 * of preferred LLC, leading to migration again
> > +		 * back to preferred LLC.
> > +		 */
> > +		if (fits_llc_capacity(src_util, src_cap) ||
> > +		    !util_greater(src_util, dst_util))
> > +			return mig_forbid;
> > +	}
> > +	return mig_allow;
> > +}
> 
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling
  2025-07-08 17:22       ` Libo Chen
@ 2025-07-09 14:41         ` Chen, Yu C
  2025-07-09 21:31           ` Libo Chen
  0 siblings, 1 reply; 68+ messages in thread
From: Chen, Yu C @ 2025-07-09 14:41 UTC (permalink / raw)
  To: Libo Chen
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Tim Chen, Peter Zijlstra, Ingo Molnar,
	K Prateek Nayak, Gautham R . Shenoy

On 7/9/2025 1:22 AM, Libo Chen wrote:
> 
> 
> On 7/8/25 01:29, Chen, Yu C wrote:
>> On 7/8/2025 8:41 AM, Libo Chen wrote:
>>> Hi Tim and Chenyu,
>>>
>>>
>>> On 6/18/25 11:27, Tim Chen wrote:
>>>> Cache-aware scheduling is designed to aggregate threads into their
>>>> preferred LLC, either via the task wake up path or the load balancing
>>>> path. One side effect is that when the preferred LLC is saturated,
>>>> more threads will continue to be stacked on it, degrading the workload's
>>>> latency. A strategy is needed to prevent this aggregation from going too
>>>> far such that the preferred LLC is too overloaded.
>>>>
>>>> Introduce helper function _get_migrate_hint() to implement the LLC
>>>> migration policy:
>>>>
>>>> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>>>>      are not too busy (<50% utilization, tunable), or the preferred
>>>>      LLC will not be too out of balanced from the non preferred LLC
>>>>      (>20% utilization, tunable, close to imbalance_pct of the LLC
>>>>      domain).
>>>> 2) Allow a task to be moved from the preferred LLC to the
>>>>      non-preferred one if the non-preferred LLC will not be too out
>>>>      of balanced from the preferred prompting an aggregation task
>>>>      migration later.  We are still experimenting with the aggregation
>>>>      and migration policy. Some other possibilities are policy based
>>>>      on LLC's load or average number of tasks running.  Those could
>>>>      be tried out by tweaking _get_migrate_hint().
>>>>
>>>> The function _get_migrate_hint() returns migration suggestions for the upper-le
>>>> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
>>>> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
>>>> +
>>>
>>>
>>> I think this patch has a great potential.
>>>
>>> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
>>> per-task llc_aggr_imb which defaults to the sysctl one? Tasks have different
>>> preferences for llc stacking, they can all be running in the same system at the
>>> same time. This way you can offer a greater deal of optimization without much
>>> burden to others.
>>
>> Yes, this doable. It can be evaluated after the global generic strategy
>> has been verified to work, like NUMA balancing :)
>>
> 
> I will run some real-world workloads and get back to you (may take some time)
> 

Thanks. It seems that there are pros and cons for different
workloads and we are evaluating adding the RSS/active nr_running
per process to deal with different type of workloads.

>>>
>>> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE?
>>
>> Do you mean the SCHED_CACHE_WAKE or SCHED_CACHE_LB?
>>
> 
> Ah I was thinking sysctl_llc_aggr_imb alone can help reduce overstacking on
> target LLC from a few hyperactive wakees (may consider to ratelimit those
> wakees as a solution), but just realize this can affect lb as well and doesn't
> really reduce overheads from frequent wakeups (no good idea on top of my head
> but we should find a better solution than sched_feat to address the overhead issue).
> 
> 
> 
>>> Does setting sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?
>>>
>>
>> My understanding is that, if sysctl_llc_aggr_imb is 0, the task aggregation
>> might still consider other aspects, like if that target LLC's utilization has
>> exceeded 50% or not.
>>
> 
> which can be controlled by sysctl_llc_aggr_cap, right? Okay so if both LLCs have
> <$(sysctl_llc_aggr_cap)% utilization, should sysctl_llc_aggr_cap be the only
> determining factor here barring NUMA balancing?
> 

If both LLC are under (sysctl_llc_aggr_cap)%, then the strategy is still 
to allow
task to be aggregated to its preferred LLC, by either asking the task to 
not be
pulled out of its preferred LLC, or migrate task to its preferred LLC,
in _get_migrate_hint().

Thanks,
Chenyu

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
                   ` (22 preceding siblings ...)
  2025-06-24  5:00 ` K Prateek Nayak
@ 2025-07-09 19:39 ` Madadi Vineeth Reddy
  2025-07-10  3:33   ` Chen, Yu C
  23 siblings, 1 reply; 68+ messages in thread
From: Madadi Vineeth Reddy @ 2025-07-09 19:39 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Hillf Danton, Len Brown, linux-kernel,
	Chen Yu, Madadi Vineeth Reddy

On 18/06/25 23:57, Tim Chen wrote:
> This is the third revision of the cache aware scheduling patches,
> based on the original patch proposed by Peter[1].
>  
> The goal of the patch series is to aggregate tasks sharing data
> to the same cache domain, thereby reducing cache bouncing and
> cache misses, and improve data access efficiency. In the current
> implementation, threads within the same process are considered
> as entities that potentially share resources.

[..snip..]

> 
> Comments and tests are much appreciated.

When running ebizzy as below:
ebizzy -t 8 -S 10

I see ~24% degradation on the patched kernel, due to higher SMT2 and
SMT4 cycles compared to the baseline. ST cycles decreased.

Since both P10 and P11 have LLC shared at the SMT4 level, even spawning
fewer threads easily crowds the LLC with the default llc_aggr_cap value
of 50. Increasing this value would likely make things worse, while
decreasing it to 25 effectively disables cache-aware scheduling
(as it limits selection to just one CPU).

I understand that ebizzy itself doesn't benefit from cache sharing, so
it might not improve but here it actually *regresses*, and the impact
may be even larger on P10 /P11 because of its smaller LLC shared by 4
CPUs, even with fewer threads. IPC drops.

By default, the SCHED_CACHE feature is enabled. Given these results for
workloads that don't share cache and on systems with smaller LLCs, I think
the default value should be revisited.

Thanks,
Madadi Vineeth Reddy

> 
> [1] https://lore.kernel.org/all/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> 
> The patches are grouped as follow:
> Patch 1:     Peter's original patch.
> Patch 2-5:   Various fixes and tuning of the original v1 patch.
> Patch 6-12:  Infrastructure and helper functions for load balancing to be cache aware.
> Patch 13-18: Add logic to load balancing for preferred LLC aggregation.
> Patch 19:    Add process LLC aggregation in load balancing sched feature.
> Patch 20:    Add Process LLC aggregation in wake up sched feature (turn off by default).
> 
> v1:
> https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> v2:
> https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/
> 
> 
> Chen Yu (3):
>   sched: Several fixes for cache aware scheduling
>   sched: Avoid task migration within its preferred LLC
>   sched: Save the per LLC utilization for better cache aware scheduling
> 
> K Prateek Nayak (1):
>   sched: Avoid calculating the cpumask if the system is overloaded
> 
> Peter Zijlstra (1):
>   sched: Cache aware load-balancing
> 
> Tim Chen (15):
>   sched: Add hysteresis to switch a task's preferred LLC
>   sched: Add helper function to decide whether to allow cache aware
>     scheduling
>   sched: Set up LLC indexing
>   sched: Introduce task preferred LLC field
>   sched: Calculate the number of tasks that have LLC preference on a
>     runqueue
>   sched: Introduce per runqueue task LLC preference counter
>   sched: Calculate the total number of preferred LLC tasks during load
>     balance
>   sched: Tag the sched group as llc_balance if it has tasks prefer other
>     LLC
>   sched: Introduce update_llc_busiest() to deal with groups having
>     preferred LLC tasks
>   sched: Introduce a new migration_type to track the preferred LLC load
>     balance
>   sched: Consider LLC locality for active balance
>   sched: Consider LLC preference when picking tasks from busiest queue
>   sched: Do not migrate task if it is moving out of its preferred LLC
>   sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>   sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>     up
> 
>  include/linux/mm_types.h       |  44 ++
>  include/linux/sched.h          |   8 +
>  include/linux/sched/topology.h |   3 +
>  init/Kconfig                   |   4 +
>  init/init_task.c               |   3 +
>  kernel/fork.c                  |   5 +
>  kernel/sched/core.c            |  25 +-
>  kernel/sched/debug.c           |   4 +
>  kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
>  kernel/sched/features.h        |   3 +
>  kernel/sched/sched.h           |  23 +
>  kernel/sched/topology.c        |  29 ++
>  12 files changed, 982 insertions(+), 28 deletions(-)
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling
  2025-07-08 21:59     ` Tim Chen
@ 2025-07-09 21:22       ` Libo Chen
  0 siblings, 0 replies; 68+ messages in thread
From: Libo Chen @ 2025-07-09 21:22 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Chen Yu



On 7/8/25 14:59, Tim Chen wrote:
> On Mon, 2025-07-07 at 17:41 -0700, Libo Chen wrote:
>> Hi Tim and Chenyu,
>>
>>
>> On 6/18/25 11:27, Tim Chen wrote:
>>> Cache-aware scheduling is designed to aggregate threads into their
>>> preferred LLC, either via the task wake up path or the load balancing
>>> path. One side effect is that when the preferred LLC is saturated,
>>> more threads will continue to be stacked on it, degrading the workload's
>>> latency. A strategy is needed to prevent this aggregation from going too
>>> far such that the preferred LLC is too overloaded.
>>>
>>> Introduce helper function _get_migrate_hint() to implement the LLC
>>> migration policy:
>>>
>>> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>>>    are not too busy (<50% utilization, tunable), or the preferred
>>>    LLC will not be too out of balanced from the non preferred LLC
>>>    (>20% utilization, tunable, close to imbalance_pct of the LLC
>>>    domain).
>>> 2) Allow a task to be moved from the preferred LLC to the
>>>    non-preferred one if the non-preferred LLC will not be too out
>>>    of balanced from the preferred prompting an aggregation task
>>>    migration later.  We are still experimenting with the aggregation
>>>    and migration policy. Some other possibilities are policy based
>>>    on LLC's load or average number of tasks running.  Those could
>>>    be tried out by tweaking _get_migrate_hint().
>>>
>>> The function _get_migrate_hint() returns migration suggestions for the upper-le
>>> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
>>> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
>>> +
>>
>>
>> I think this patch has a great potential.
>>
> 
> Thanks for taking a look.
> 
>> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
>> per-task llc_aggr_imb which defaults to the sysctl one? 
>>
> 
> _get_migrate_hint() could also be called from llc_balance(). At that time
> we make a determination of whether we should do llc_balance() without knowing
> which exact task we're going to move, but still observe the migration policy
> that shouldn't cause too much imbalance.  So it may not be strictly tied to a task
> in the current implementation.
> 
Ah right, by setting task_util to 0

>> Tasks have different
>> preferences for llc stacking, they can all be running in the same system at the
>> same time. This way you can offer a greater deal of optimization without much
>> burden to others.
> 
> You're thinking of something like a prctl knob that will bias aggregation for
> some process?  Wonder if Peter has some opinion on this.
> 

Yes. I am sure he has hhh but we can wait until the global approach is good enough
like Chen Yu said.

>>
>> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE?
>>
> 
> Actually we think that we can do without SCHED_CACHE_WAKE feature and rely only
> on load balance SCHED_CACHE_LB.  But still keeping 
> 
>>  Does setting
>> sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?
> 
> Aggregation will tend to make utilization on the preferred LLC to be more
> than the non-preferred one.  Parameter "sysctl_llc_aggr_imb" is the imbalance
> allowed.  If we set this to 0, as long as the preferred LLC is not utilized
> more than the source LLC, we could still aggregate towards the preferred LLC
> and a preference could still be there.  
> 

I see, I think I have better understanding of this now. Thanks!

Libo

> Tim
> 
>>
>> Thanks,
>> Libo
>>
>>> +static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
>>> +					   unsigned long tsk_util,
>>> +					   bool to_pref)
>>> +{
>>> +	unsigned long src_util, dst_util, src_cap, dst_cap;
>>> +
>>> +	if (cpus_share_cache(src_cpu, dst_cpu))
>>> +		return mig_allow;
>>> +
>>> +	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
>>> +	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
>>> +		return mig_allow;
>>> +
>>> +	if (!fits_llc_capacity(dst_util, dst_cap) &&
>>> +	    !fits_llc_capacity(src_util, src_cap))
>>> +		return mig_ignore;
>>> +
>>> +	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
>>> +	dst_util = dst_util + tsk_util;
>>> +	if (to_pref) {
>>> +		/*
>>> +		 * sysctl_llc_aggr_imb is the imbalance allowed between
>>> +		 * preferred LLC and non-preferred LLC.
>>> +		 * Don't migrate if we will get preferred LLC too
>>> +		 * heavily loaded and if the dest is much busier
>>> +		 * than the src, in which case migration will
>>> +		 * increase the imbalance too much.
>>> +		 */
>>> +		if (!fits_llc_capacity(dst_util, dst_cap) &&
>>> +		    util_greater(dst_util, src_util))
>>> +			return mig_forbid;
>>> +	} else {
>>> +		/*
>>> +		 * Don't migrate if we will leave preferred LLC
>>> +		 * too idle, or if this migration leads to the
>>> +		 * non-preferred LLC falls within sysctl_aggr_imb percent
>>> +		 * of preferred LLC, leading to migration again
>>> +		 * back to preferred LLC.
>>> +		 */
>>> +		if (fits_llc_capacity(src_util, src_cap) ||
>>> +		    !util_greater(src_util, dst_util))
>>> +			return mig_forbid;
>>> +	}
>>> +	return mig_allow;
>>> +}
>>
>>
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling
  2025-07-09 14:41         ` Chen, Yu C
@ 2025-07-09 21:31           ` Libo Chen
  0 siblings, 0 replies; 68+ messages in thread
From: Libo Chen @ 2025-07-09 21:31 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Abel Wu, Madadi Vineeth Reddy, Hillf Danton, Len Brown,
	linux-kernel, Tim Chen, Peter Zijlstra, Ingo Molnar,
	K Prateek Nayak, Gautham R . Shenoy



On 7/9/25 07:41, Chen, Yu C wrote:
> On 7/9/2025 1:22 AM, Libo Chen wrote:
>>
>>
>> On 7/8/25 01:29, Chen, Yu C wrote:
>>> On 7/8/2025 8:41 AM, Libo Chen wrote:
>>>> Hi Tim and Chenyu,
>>>>
>>>>
>>>> On 6/18/25 11:27, Tim Chen wrote:
>>>>> Cache-aware scheduling is designed to aggregate threads into their
>>>>> preferred LLC, either via the task wake up path or the load balancing
>>>>> path. One side effect is that when the preferred LLC is saturated,
>>>>> more threads will continue to be stacked on it, degrading the workload's
>>>>> latency. A strategy is needed to prevent this aggregation from going too
>>>>> far such that the preferred LLC is too overloaded.
>>>>>
>>>>> Introduce helper function _get_migrate_hint() to implement the LLC
>>>>> migration policy:
>>>>>
>>>>> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>>>>>      are not too busy (<50% utilization, tunable), or the preferred
>>>>>      LLC will not be too out of balanced from the non preferred LLC
>>>>>      (>20% utilization, tunable, close to imbalance_pct of the LLC
>>>>>      domain).
>>>>> 2) Allow a task to be moved from the preferred LLC to the
>>>>>      non-preferred one if the non-preferred LLC will not be too out
>>>>>      of balanced from the preferred prompting an aggregation task
>>>>>      migration later.  We are still experimenting with the aggregation
>>>>>      and migration policy. Some other possibilities are policy based
>>>>>      on LLC's load or average number of tasks running.  Those could
>>>>>      be tried out by tweaking _get_migrate_hint().
>>>>>
>>>>> The function _get_migrate_hint() returns migration suggestions for the upper-le
>>>>> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
>>>>> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
>>>>> +
>>>>
>>>>
>>>> I think this patch has a great potential.
>>>>
>>>> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
>>>> per-task llc_aggr_imb which defaults to the sysctl one? Tasks have different
>>>> preferences for llc stacking, they can all be running in the same system at the
>>>> same time. This way you can offer a greater deal of optimization without much
>>>> burden to others.
>>>
>>> Yes, this doable. It can be evaluated after the global generic strategy
>>> has been verified to work, like NUMA balancing :)
>>>
>>
>> I will run some real-world workloads and get back to you (may take some time)
>>
> 
> Thanks. It seems that there are pros and cons for different
> workloads and we are evaluating adding the RSS/active nr_running
> per process to deal with different type of workloads.
> 
>>>>
>>>> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE?
>>>
>>> Do you mean the SCHED_CACHE_WAKE or SCHED_CACHE_LB?
>>>
>>
>> Ah I was thinking sysctl_llc_aggr_imb alone can help reduce overstacking on
>> target LLC from a few hyperactive wakees (may consider to ratelimit those
>> wakees as a solution), but just realize this can affect lb as well and doesn't
>> really reduce overheads from frequent wakeups (no good idea on top of my head
>> but we should find a better solution than sched_feat to address the overhead issue).
>>
btw just for correction, I meant wakers here not wakees 
>>
>>
>>>> Does setting sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?
>>>>
>>>
>>> My understanding is that, if sysctl_llc_aggr_imb is 0, the task aggregation
>>> might still consider other aspects, like if that target LLC's utilization has
>>> exceeded 50% or not.
>>>
>>
>> which can be controlled by sysctl_llc_aggr_cap, right? Okay so if both LLCs have
>> <$(sysctl_llc_aggr_cap)% utilization, should sysctl_llc_aggr_cap be the only
>> determining factor here barring NUMA balancing?
>>
> 
> If both LLC are under (sysctl_llc_aggr_cap)%, then the strategy is still to allow
> task to be aggregated to its preferred LLC, by either asking the task to not be
> pulled out of its preferred LLC, or migrate task to its preferred LLC,
> in _get_migrate_hint().
> 
Ok, got it. It looks to me sysctl_llc_aggr_imb and sysctl_llc_aggr_cap can have quite
an impact on perf. I will play around with different values a bit.

Libo


> Thanks,
> Chenyu


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC patch v3 00/20] Cache aware scheduling
  2025-07-09 19:39 ` Madadi Vineeth Reddy
@ 2025-07-10  3:33   ` Chen, Yu C
  0 siblings, 0 replies; 68+ messages in thread
From: Chen, Yu C @ 2025-07-10  3:33 UTC (permalink / raw)
  To: Madadi Vineeth Reddy, Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Hillf Danton, Len Brown, linux-kernel,
	vernhao

On 7/10/2025 3:39 AM, Madadi Vineeth Reddy wrote:
> On 18/06/25 23:57, Tim Chen wrote:
>> This is the third revision of the cache aware scheduling patches,
>> based on the original patch proposed by Peter[1].
>>   
>> The goal of the patch series is to aggregate tasks sharing data
>> to the same cache domain, thereby reducing cache bouncing and
>> cache misses, and improve data access efficiency. In the current
>> implementation, threads within the same process are considered
>> as entities that potentially share resources.
> 
> [..snip..]
> 
>>
>> Comments and tests are much appreciated.
> 
> When running ebizzy as below:
> ebizzy -t 8 -S 10
> 
> I see ~24% degradation on the patched kernel, due to higher SMT2 and
> SMT4 cycles compared to the baseline. ST cycles decreased.
> 
> Since both P10 and P11 have LLC shared at the SMT4 level, even spawning
> fewer threads easily crowds the LLC with the default llc_aggr_cap value
> of 50. Increasing this value would likely make things worse, while
> decreasing it to 25 effectively disables cache-aware scheduling
> (as it limits selection to just one CPU).
> 
> I understand that ebizzy itself doesn't benefit from cache sharing, so
> it might not improve but here it actually *regresses*, and the impact
> may be even larger on P10 /P11 because of its smaller LLC shared by 4
> CPUs, even with fewer threads. IPC drops.
> 
> By default, the SCHED_CACHE feature is enabled. Given these results for
> workloads that don't share cache and on systems with smaller LLCs, I think
> the default value should be revisited.
> 

Thanks for the test. I agree with you. The SMT number,
the L3 cache size, the workload's working set size should
all be considered to find a proper threshold to enable/disable
task aggregation.

thanks,
Chenyu

> Thanks,
> Madadi Vineeth Reddy
> 


^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2025-07-10  3:34 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-18 18:27 [RFC patch v3 00/20] Cache aware scheduling Tim Chen
2025-06-18 18:27 ` [RFC patch v3 01/20] sched: Cache aware load-balancing Tim Chen
2025-06-26 12:23   ` Jianyong Wu
2025-06-26 13:32     ` Chen, Yu C
2025-06-27  0:10       ` Tim Chen
2025-06-27  2:13         ` Jianyong Wu
2025-07-03 19:29   ` Shrikanth Hegde
2025-07-04  8:40     ` Chen, Yu C
2025-07-04  8:45       ` Peter Zijlstra
2025-07-04  8:54         ` Shrikanth Hegde
2025-07-07 19:57     ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 02/20] sched: Several fixes for cache aware scheduling Tim Chen
2025-07-03 19:33   ` Shrikanth Hegde
2025-07-07 21:02     ` Tim Chen
2025-07-08  1:15   ` Libo Chen
2025-07-08  7:54     ` Chen, Yu C
2025-07-08 15:47       ` Libo Chen
2025-06-18 18:27 ` [RFC patch v3 03/20] sched: Avoid task migration within its preferred LLC Tim Chen
2025-06-18 18:27 ` [RFC patch v3 04/20] sched: Avoid calculating the cpumask if the system is overloaded Tim Chen
2025-07-03 19:39   ` Shrikanth Hegde
2025-07-07 14:57     ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 05/20] sched: Add hysteresis to switch a task's preferred LLC Tim Chen
2025-07-02  6:47   ` Madadi Vineeth Reddy
2025-07-02 21:47     ` Tim Chen
2025-06-18 18:27 ` [RFC patch v3 06/20] sched: Save the per LLC utilization for better cache aware scheduling Tim Chen
2025-06-18 18:27 ` [RFC patch v3 07/20] sched: Add helper function to decide whether to allow " Tim Chen
2025-07-08  0:41   ` Libo Chen
2025-07-08  8:29     ` Chen, Yu C
2025-07-08 17:22       ` Libo Chen
2025-07-09 14:41         ` Chen, Yu C
2025-07-09 21:31           ` Libo Chen
2025-07-08 21:59     ` Tim Chen
2025-07-09 21:22       ` Libo Chen
2025-06-18 18:27 ` [RFC patch v3 08/20] sched: Set up LLC indexing Tim Chen
2025-07-03 19:44   ` Shrikanth Hegde
2025-07-04  9:36     ` Chen, Yu C
2025-06-18 18:27 ` [RFC patch v3 09/20] sched: Introduce task preferred LLC field Tim Chen
2025-06-18 18:27 ` [RFC patch v3 10/20] sched: Calculate the number of tasks that have LLC preference on a runqueue Tim Chen
2025-07-03 19:45   ` Shrikanth Hegde
2025-07-04 15:00     ` Chen, Yu C
2025-06-18 18:27 ` [RFC patch v3 11/20] sched: Introduce per runqueue task LLC preference counter Tim Chen
2025-06-18 18:28 ` [RFC patch v3 12/20] sched: Calculate the total number of preferred LLC tasks during load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 13/20] sched: Tag the sched group as llc_balance if it has tasks prefer other LLC Tim Chen
2025-06-18 18:28 ` [RFC patch v3 14/20] sched: Introduce update_llc_busiest() to deal with groups having preferred LLC tasks Tim Chen
2025-07-03 19:52   ` Shrikanth Hegde
2025-07-05  2:26     ` Chen, Yu C
2025-06-18 18:28 ` [RFC patch v3 15/20] sched: Introduce a new migration_type to track the preferred LLC load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 16/20] sched: Consider LLC locality for active balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 17/20] sched: Consider LLC preference when picking tasks from busiest queue Tim Chen
2025-06-18 18:28 ` [RFC patch v3 18/20] sched: Do not migrate task if it is moving out of its preferred LLC Tim Chen
2025-06-18 18:28 ` [RFC patch v3 19/20] sched: Introduce SCHED_CACHE_LB to control cache aware load balance Tim Chen
2025-06-18 18:28 ` [RFC patch v3 20/20] sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake up Tim Chen
2025-06-19  6:39 ` [RFC patch v3 00/20] Cache aware scheduling Yangyu Chen
2025-06-19 13:21   ` Chen, Yu C
2025-06-19 14:12     ` Yangyu Chen
2025-06-20 19:25 ` Madadi Vineeth Reddy
2025-06-22  0:39   ` Chen, Yu C
2025-06-24 17:47     ` Madadi Vineeth Reddy
2025-06-23 16:45   ` Tim Chen
2025-06-24  5:00 ` K Prateek Nayak
2025-06-24 12:16   ` Chen, Yu C
2025-06-25  4:19     ` K Prateek Nayak
2025-06-25  0:30   ` Tim Chen
2025-06-25  4:30     ` K Prateek Nayak
2025-07-03 20:00   ` Shrikanth Hegde
2025-07-04 10:09     ` Chen, Yu C
2025-07-09 19:39 ` Madadi Vineeth Reddy
2025-07-10  3:33   ` Chen, Yu C

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).