[PATCH v2 00/23] Cache aware scheduling

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/23] Cache aware scheduling
@ 2025-12-03 23:07 Tim Chen
  2025-12-03 23:07 ` [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
                   ` (23 more replies)
  0 siblings, 24 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

This patch series introduces infrastructure for cache-aware load
balancing, with the goal of co-locating tasks that share data on the
same Last Level Cache (LLC) domain. By improving cache locality, the
scheduler can reduce cache bouncing and cache misses, ultimately
improving data access efficiency. The design builds on the initial
prototype from Peter [1].
 
In this initial implementation, threads within the same process are
treated as entities that are likely to share data. During load
balancing, the scheduler attempts to aggregate these threads onto the
same LLC domain whenever possible.
 
We would like to thank everyone who provided feedbacks on the v1
series[1]. Most of the comments have been addressed in this revision.
Several broader suggestions surfaced during review, and we believe
they are best approached in follow-up work once the foundational
cache-aware scheduling infrastructure is merged:
 
1. **Generalizing task grouping beyond processes.**
   While v2 focuses on grouping threads within a single process, other
   classes of workloads naturally share data and could benefit from LLC
   co-location, such as:
   a) Tasks from different processes that operate on shared data.
   b) Tasks belonging to the same NUMA group.
   c) Tasks with strong waker/wakee relationships.
   d) User-defined groups via cgroups or other user interfaces.
 
2. **Configurable cache-aware scheduling policies.**
   The current iteration implements a global cache-aware scheduling
   policy. Future work may introduce per-process or per-task-group
   policies, exposed through prctl() or other mechanisms.
 
**v2 Changes:**
1. Align NUMA balancing and cache affinity by
   prioritizing NUMA balancing when their decisions differ.
2. Dynamically resize per-LLC statistics structures based on the LLC
   size.
3. Switch to a contiguous LLC-ID space so these IDs can be used
   directly as array indices for LLC statistics.
4. Add clarification comments.
5. Add 3 debug patches (not meant for merging).
6. Other changes to address feedbacks from review of v1 patch set
   (see individual patch change log).
 
Test results:

The patch series was applied and tested on v6.18-rc7.
See: https://github.com/timcchen1298/linux/commits/cache_aware_v2

The first test platform is a 2 socket Intel Sapphire Rapids with 30
cores per socket. The DRAM interleaving is enabled in the BIOS so it
essential has one NUMA node with two last level caches. There are 60
CPUs associated with each last level cache.

The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
per node. Each node has 2 CCXs and each CCX has 16 CPUs.

hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched on
these two platforms.

[TL;DR]
Sappire Rapids:
hackbench shows significant improvement when the number of
different active threads is below the capacity of a LLC.
schbench shows overall wakeup latency improvement.
ChaCha20-xiangshan shows good throughput improvement.

Genoa:
ChaCha20-xiangshan shows huge throughput improvement.
No obvious difference is observed in hackbench/schbench
/netperf/stream/stress-ng.
Phoronix has tested v1 and shows good improvements
in 33 cases[2].

Detail:
Due to length constraints, only part of the data is presented.

Sapphire Rapids:

hackbench thread pipes
                           baseline            sched_cache
       groups
Amean     1      38.8224 (   0.00%)     26.4582 *  31.85%*
Amean     3      38.2358 (   0.00%)     38.0758 (   0.42%)
Amean     5      40.7282 (   0.00%)     41.1568 (  -1.05%)
Amean     7      51.1720 (   0.00%)     50.6646 (   0.99%)
Amean     12     63.1562 (   0.00%)     63.3516 (  -0.31%)
Amean     16     73.9584 (   0.00%)     75.5596 (  -2.17%)
Max       1      39.4140 (   0.00%)     26.7590 (  32.11%)
Max       3      40.8310 (   0.00%)     39.8000 (   2.53%)
Max       5      42.2150 (   0.00%)     42.4860 (  -0.64%)
Max       7      52.1800 (   0.00%)     51.9370 (   0.47%)
Max       12     63.9430 (   0.00%)     64.2820 (  -0.53%)
Max       16     74.3710 (   0.00%)     76.4170 (  -2.75%)

further test hackbench using other number of fds:

case         fd          groups         baseline(std%)  compare%( std%)
threads-pipe-2          1-groups         1.00 (  1.25)  +38.52 (  1.33)
threads-pipe-2          2-groups         1.00 ( 12.52)  +12.74 (  1.31)
threads-pipe-2          4-groups         1.00 (  7.91)  +12.29 (  1.86)
threads-pipe-4          1-groups         1.00 (  0.55)  +34.99 (  0.45)
threads-pipe-4          2-groups         1.00 ( 16.00)  +27.32 (  0.75)
threads-pipe-4          4-groups         1.00 ( 17.37)  +25.75 (  0.20)
threads-pipe-8          1-groups         1.00 (  0.74)  +27.13 (  0.44)
threads-pipe-8          2-groups         1.00 (  8.82)  +23.79 (  0.32)
threads-pipe-8          4-groups         1.00 (  1.30)  +27.64 (  0.51)
threads-pipe-16         1-groups         1.00 (  1.03)  +30.55 (  0.27)
threads-pipe-16         2-groups         1.00 (  6.43)  +29.52 (  0.20)
threads-pipe-16         4-groups         1.00 (  1.36)   -1.85 (  1.43)
threads-pipe-20         1-groups         1.00 (  0.45)  +30.88 (  0.42)
threads-pipe-20         2-groups         1.00 (  1.95)   -0.81 (  5.84)
threads-pipe-20         4-groups         1.00 (  2.09)   -1.77 (  7.57)

stream:
                              baseline            sched_cache
GB/sec copy-2        36.48 (   0.00%)       36.55 (   0.18%)
GB/sec scale-2       36.83 (   0.00%)       36.97 (   0.38%)
GB/sec add-2         37.92 (   0.00%)       38.03 (   0.31%)
GB/sec triad-2       37.83 (   0.00%)       37.97 (   0.37%)

stress-ng context switch:
                                    baseline            sched_cache
Min       context-1       2957.81 (   0.00%)     2966.17 (   0.28%)
Min       context-2       5931.68 (   0.00%)     5930.17 (  -0.03%)
Min       context-4      11874.20 (   0.00%)    11875.68 (   0.01%)
Min       context-8      23755.30 (   0.00%)    23762.43 (   0.03%)
Min       context-16     47535.14 (   0.00%)    47526.46 (  -0.02%)
Min       context-32     95078.66 (   0.00%)    94356.39 (  -0.76%)
Min       context-64    190074.62 (   0.00%)   190042.93 (  -0.02%)
Min       context-128   371107.12 (   0.00%)   371008.10 (  -0.03%)
Min       context-256   578443.73 (   0.00%)   579037.86 (   0.10%)
Min       context-480   580203.34 (   0.00%)   580499.43 (   0.05%)
Hmean     context-1       2964.59 (   0.00%)     2967.69 (   0.10%)
Hmean     context-2       5936.41 (   0.00%)     5935.51 (  -0.02%)
Hmean     context-4      11879.56 (   0.00%)    11881.70 (   0.02%)
Hmean     context-8      23771.92 (   0.00%)    23770.28 (  -0.01%)
Hmean     context-16     47552.23 (   0.00%)    47538.01 (  -0.03%)
Hmean     context-32     95102.67 (   0.00%)    94969.43 (  -0.14%)
Hmean     context-64    190129.74 (   0.00%)   190088.68 (  -0.02%)
Hmean     context-128   371291.95 (   0.00%)   371114.82 (  -0.05%)
Hmean     context-256   578907.96 (   0.00%)   579338.99 (   0.07%)
Hmean     context-480   580541.78 (   0.00%)   580726.13 (   0.03%)
Max       context-1       2967.93 (   0.00%)     2968.90 (   0.03%)
Max       context-2       5942.37 (   0.00%)     5940.40 (  -0.03%)
Max       context-4      11885.25 (   0.00%)    11886.43 (   0.01%)
Max       context-8      23784.17 (   0.00%)    23783.31 (  -0.00%)
Max       context-16     47576.84 (   0.00%)    47561.42 (  -0.03%)
Max       context-32     95139.03 (   0.00%)    95094.86 (  -0.05%)
Max       context-64    190180.08 (   0.00%)   190123.31 (  -0.03%)
Max       context-128   371451.73 (   0.00%)   371240.25 (  -0.06%)
Max       context-256   579355.24 (   0.00%)   579731.37 (   0.06%)
Max       context-480   580750.44 (   0.00%)   581118.33 (   0.06%)
BHmean-50 context-1       2966.80 (   0.00%)     2968.82 (   0.07%)
BHmean-50 context-2       5939.32 (   0.00%)     5939.49 (   0.00%)
BHmean-50 context-4      11883.02 (   0.00%)    11886.08 (   0.03%)
BHmean-50 context-8      23778.40 (   0.00%)    23775.90 (  -0.01%)
BHmean-50 context-16     47568.31 (   0.00%)    47546.19 (  -0.05%)
BHmean-50 context-32     95125.84 (   0.00%)    95087.06 (  -0.04%)
BHmean-50 context-64    190165.37 (   0.00%)   190117.94 (  -0.02%)
BHmean-50 context-128   371405.28 (   0.00%)   371168.75 (  -0.06%)
BHmean-50 context-256   579137.11 (   0.00%)   579609.35 (   0.08%)
BHmean-50 context-480   580646.72 (   0.00%)   580920.46 (   0.05%)
BHmean-95 context-1       2965.72 (   0.00%)     2967.94 (   0.07%)
BHmean-95 context-2       5937.20 (   0.00%)     5936.40 (  -0.01%)
BHmean-95 context-4      11880.45 (   0.00%)    11882.71 (   0.02%)
BHmean-95 context-8      23774.69 (   0.00%)    23771.59 (  -0.01%)
BHmean-95 context-16     47555.08 (   0.00%)    47539.93 (  -0.03%)
BHmean-95 context-32     95106.67 (   0.00%)    95072.38 (  -0.04%)
BHmean-95 context-64    190138.93 (   0.00%)   190096.30 (  -0.02%)
BHmean-95 context-128   371322.78 (   0.00%)   371132.61 (  -0.05%)
BHmean-95 context-256   578985.41 (   0.00%)   579389.21 (   0.07%)
BHmean-95 context-480   580598.22 (   0.00%)   580763.93 (   0.03%)
BHmean-99 context-1       2965.72 (   0.00%)     2967.94 (   0.07%)
BHmean-99 context-2       5937.20 (   0.00%)     5936.40 (  -0.01%)
BHmean-99 context-4      11880.45 (   0.00%)    11882.71 (   0.02%)
BHmean-99 context-8      23774.69 (   0.00%)    23771.59 (  -0.01%)
BHmean-99 context-16     47555.08 (   0.00%)    47539.93 (  -0.03%)
BHmean-99 context-32     95106.67 (   0.00%)    95072.38 (  -0.04%)
BHmean-99 context-64    190138.93 (   0.00%)   190096.30 (  -0.02%)
BHmean-99 context-128   371322.78 (   0.00%)   371132.61 (  -0.05%)
BHmean-99 context-256   578985.41 (   0.00%)   579389.21 (   0.07%)
BHmean-99 context-480   580598.22 (   0.00%)   580763.93 (   0.03%)

schbench thread = 1
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        10.71(0.76)          9.86(1.46)           +7.94%    
Request Latencies 99.0th       4036.00(6.53)        4054.29(10.03)       -0.45%    
RPS 50.0th                     267.29(0.49)         266.86(0.38)         -0.16%    
Average RPS                    268.42(0.16)         267.86(0.31)         -0.21%    

schbench thread = 2
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        11.43(1.13)          8.00(2.00)           +30.01%   
Request Latencies 99.0th       4007.43(34.52)       3967.43(70.03)       +1.00%    
RPS 50.0th                     536.71(0.76)         536.14(1.57)         -0.11%    
Average RPS                    536.59(0.55)         535.33(1.34)         -0.23%    

schbench thread = 4
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        9.57(0.79)           6.14(1.46)           +35.84%   
Request Latencies 99.0th       3789.14(31.47)       3810.86(48.97)       -0.57%    
RPS 50.0th                     1074.00(0.00)        1073.43(2.76)        -0.05%    
Average RPS                    1075.03(1.07)        1072.93(2.13)        -0.20%    

schbench thread = 8
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        9.29(0.49)           6.57(1.81)           +29.28%   
Request Latencies 99.0th       3756.00(19.60)       3769.71(23.87)       -0.37%    
RPS 50.0th                     2152.57(4.28)        2152.57(4.28)        0.00%     
Average RPS                    2151.07(2.71)        2150.58(3.41)        -0.02%    

schbench thread = 16
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        9.43(0.53)           6.86(0.90)           +27.25%   
Request Latencies 99.0th       3780.00(32.98)       3774.29(11.04)       +0.15%    
RPS 50.0th                     4305.14(8.55)        4307.43(7.81)        +0.05%    
Average RPS                    4303.47(5.74)        4301.71(4.35)        -0.04%    

schbench thread = 32
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        10.14(0.38)          6.86(0.69)           +32.35%   
Request Latencies 99.0th       3764.00(21.66)       3806.29(32.24)       -1.12%    
RPS 50.0th                     8624.00(0.00)        8619.43(12.09)       -0.05%    
Average RPS                    8607.36(5.29)        8602.69(7.08)        -0.05%    

schbench thread = 64
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        11.71(0.49)          8.43(1.81)           +28.01%   
Request Latencies 99.0th       3796.00(62.48)       3860.25(147.35)      -1.69%  
RPS 50.0th                     17238.86(24.19)      16411.43(88.95)      -4.80%    
Average RPS                    17209.02(10.18)      16389.73(100.27)     -4.76%    

schbench thread = 128
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        13.29(0.49)          12.00(0.00)          +9.71%    
Request Latencies 99.0th       7893.71(11.04)       7909.71(17.10)       -0.20%    
RPS 50.0th                     32013.71(194.52)     32068.57(50.35)      +0.17%    
Average RPS                    31762.03(238.18)     31884.81(300.85)     +0.39%    

schbench thread = 239
Metric                         Base (mean±std)      Compare (mean±std)   Change    
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th        13.29(0.49)          14.43(0.53)          -8.58%    
Request Latencies 99.0th       8174.86(8.55)        8244.57(12.09)       -0.85%    
RPS 50.0th                     30624.00(0.00)       30614.86(24.19)      -0.03%    
Average RPS                    30695.86(11.03)      30673.35(17.31)      -0.07%    

chacha20:
baseline:
Host time spent: 66,320ms
sched_cache:
Host time spent: 53,859ms
Time reduced by 18%, throughput increased by 23%

Genoa:
chacha20
baseline:
Host time spent: 51,848ms
sched_cache:
Host time spent: 28,439ms

Time reduced by 45%, throughput increased by 82%

[1] https://lore.kernel.org/all/cover.1760206683.git.tim.c.chen@linux.intel.com/
[2] https://www.phoronix.com/review/cache-aware-scheduling-amd-turin

Chen Yu (10):
  sched/cache: Record per-LLC utilization to guide cache-aware
    scheduling decisions
  sched/cache: Introduce helper functions to enforce LLC migration
    policy
  sched/cache: Introduce sched_cache_present to enable cache aware
    scheduling for multi LLCs NUMA node
  sched/cache: Record the number of active threads per process for
    cache-aware scheduling
  sched/cache: Disable cache aware scheduling for processes with high
    thread counts
  sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  sched/cache: Add user control to adjust the parameters of cache-aware
    scheduling
  -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware
    load balancing
  -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load
    balance statistics
  -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy
    for each process via proc fs

Peter Zijlstra (Intel) (1):
  sched/cache: Introduce infrastructure for cache-aware load balancing

Tim Chen (12):
  sched/cache: Make LLC id continuous
  sched/cache: Assign preferred LLC ID to processes
  sched/cache: Track LLC-preferred tasks per runqueue
  sched/cache: Introduce per runqueue task LLC preference counter
  sched/cache: Calculate the per runqueue task LLC preference
  sched/cache: Count tasks prefering destination LLC in a sched group
  sched/cache: Check local_group only once in update_sg_lb_stats()
  sched/cache: Prioritize tasks preferring destination LLC during
    balancing
  sched/cache: Add migrate_llc_task migration type for cache-aware
    balancing
  sched/cache: Handle moving single tasks to/from their preferred LLC
  sched/cache: Consider LLC preference when selecting tasks for load
    balancing
  sched/cache: Respect LLC preference in task migration and detach

 fs/proc/base.c                 |   22 +
 include/linux/cacheinfo.h      |   21 +-
 include/linux/mm_types.h       |   60 ++
 include/linux/sched.h          |   19 +
 include/linux/sched/topology.h |    5 +
 include/trace/events/sched.h   |   31 +
 init/Kconfig                   |   11 +
 init/init_task.c               |    4 +
 kernel/fork.c                  |    6 +
 kernel/sched/core.c            |   12 +
 kernel/sched/debug.c           |   62 ++
 kernel/sched/fair.c            | 1034 +++++++++++++++++++++++++++++++-
 kernel/sched/sched.h           |   39 ++
 kernel/sched/stats.c           |    5 +-
 kernel/sched/topology.c        |  239 +++++++-
 15 files changed, 1543 insertions(+), 27 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-09 11:12   ` Peter Zijlstra
                     ` (3 more replies)
  2025-12-03 23:07 ` [PATCH v2 02/23] sched/cache: Record per-LLC utilization to guide cache-aware scheduling decisions Tim Chen
                   ` (22 subsequent siblings)
  23 siblings, 4 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen,
	linux-kernel

From: "Peter Zijlstra (Intel)" <peterz@infradead.org>

Adds infrastructure to enable cache-aware load balancing,
which improves cache locality by grouping tasks that share resources
within the same cache domain. This reduces cache misses and improves
overall data access efficiency.

In this initial implementation, threads belonging to the same process
are treated as entities that likely share working sets. The mechanism
tracks per-process CPU occupancy across cache domains and attempts to
migrate threads toward cache-hot domains where their process already
has active threads, thereby enhancing locality.

This provides a basic model for cache affinity. While the current code
targets the last-level cache (LLC), the approach could be extended to
other domain types such as clusters (L2) or node-internal groupings.

At present, the mechanism selects the CPU within an LLC that has the
highest recent runtime. Subsequent patches in this series will use this
information in the load-balancing path to guide task placement toward
preferred LLCs.

In the future, more advanced policies could be integrated through NUMA
balancing-for example, migrating a task to its preferred LLC when spare
capacity exists, or swapping tasks across LLCs to improve cache affinity.
Grouping of tasks could also be generalized from that of a process
to be that of a NUMA group, or be user configurable.

Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
       Restore the original CPU scan to cover all online CPUs,
       rather than scanning within the preferred NUMA node.
       (Peter Zijlstra)
    
       Use rq->curr instead of rq->donor. (K Prateek Nayak)
    
       Minor fix in task_tick_cache() to use
       if (mm->mm_sched_epoch >= rq->cpu_epoch)
       to avoid mm_sched_epoch going backwards.

 include/linux/mm_types.h |  44 +++++++
 include/linux/sched.h    |  11 ++
 init/Kconfig             |  11 ++
 kernel/fork.c            |   6 +
 kernel/sched/core.c      |   6 +
 kernel/sched/fair.c      | 258 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h     |   8 ++
 7 files changed, 344 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 90e5790c318f..1ea16ef90566 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -939,6 +939,11 @@ typedef struct {
 	DECLARE_BITMAP(__mm_flags, NUM_MM_FLAG_BITS);
 } __private mm_flags_t;
 
+struct mm_sched {
+	u64 runtime;
+	unsigned long epoch;
+};
+
 struct kioctx_table;
 struct iommu_mm_data;
 struct mm_struct {
@@ -1029,6 +1034,17 @@ struct mm_struct {
 		 */
 		raw_spinlock_t cpus_allowed_lock;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		/*
+		 * Track per-cpu-per-process occupancy as a proxy for cache residency.
+		 * See account_mm_sched() and ...
+		 */
+		struct mm_sched __percpu *pcpu_sched;
+		raw_spinlock_t mm_sched_lock;
+		unsigned long mm_sched_epoch;
+		int mm_sched_cpu;
+#endif
+
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
 #endif
@@ -1487,6 +1503,34 @@ static inline unsigned int mm_cid_size(void)
 static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
 #endif /* CONFIG_SCHED_MM_CID */
 
+#ifdef CONFIG_SCHED_CACHE
+void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sched);
+
+static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
+{
+	struct mm_sched __percpu *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
+
+	if (!pcpu_sched)
+		return -ENOMEM;
+
+	mm_init_sched(mm, pcpu_sched);
+	return 0;
+}
+
+#define mm_alloc_sched(...)	alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__))
+
+static inline void mm_destroy_sched(struct mm_struct *mm)
+{
+	free_percpu(mm->pcpu_sched);
+	mm->pcpu_sched = NULL;
+}
+#else /* !CONFIG_SCHED_CACHE */
+
+static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; }
+static inline void mm_destroy_sched(struct mm_struct *mm) { }
+
+#endif /* CONFIG_SCHED_CACHE */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b469878de25c..278b529c91df 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1406,6 +1406,10 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_SCHED_CACHE
+	struct callback_head		cache_work;
+#endif
+
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
 	u32 rseq_len;
@@ -2428,4 +2432,11 @@ extern void migrate_enable(void);
 
 DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
 
+#ifdef CONFIG_SCHED_CACHE
+static inline bool sched_cache_enabled(void)
+{
+	return false;
+}
+#endif
+
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index cab3ad28ca49..88556ef8cfd1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -983,6 +983,17 @@ config NUMA_BALANCING
 
 	  This system will be inactive on UMA systems.
 
+config SCHED_CACHE
+	bool "Cache aware load balance"
+	default y
+	depends on SMP
+	help
+	  When enabled, the scheduler will attempt to aggregate tasks from
+	  the same process onto a single Last Level Cache (LLC) domain when
+	  possible. This improves cache locality by keeping tasks that share
+	  resources within the same cache domain, reducing cache misses and
+	  lowering data access latency.
+
 config NUMA_BALANCING_DEFAULT_ENABLED
 	bool "Automatically enable NUMA aware memory/task placement"
 	default y
diff --git a/kernel/fork.c b/kernel/fork.c
index 3da0f08615a9..aae5053d1e30 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -680,6 +680,7 @@ void __mmdrop(struct mm_struct *mm)
 	cleanup_lazy_tlbs(mm);
 
 	WARN_ON_ONCE(mm == current->active_mm);
+	mm_destroy_sched(mm);
 	mm_free_pgd(mm);
 	mm_free_id(mm);
 	destroy_context(mm);
@@ -1083,6 +1084,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_cid(mm, p))
 		goto fail_cid;
 
+	if (mm_alloc_sched(mm))
+		goto fail_sched;
+
 	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
 				     NR_MM_COUNTERS))
 		goto fail_pcpu;
@@ -1092,6 +1096,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	return mm;
 
 fail_pcpu:
+	mm_destroy_sched(mm);
+fail_sched:
 	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f754a60de848..e8bdf03a4b7f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4488,6 +4488,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
 	p->wake_entry.u_flags = CSD_TYPE_TTWU;
 	p->migration_pending = NULL;
 	init_sched_mm_cid(p);
+	init_sched_mm(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -8791,6 +8792,11 @@ void __init sched_init(void)
 
 		rq->core_cookie = 0UL;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		raw_spin_lock_init(&rq->cpu_epoch_lock);
+		rq->cpu_epoch_next = jiffies;
+#endif
+
 		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b752324270b..cb82f558dc5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1152,6 +1152,8 @@ void post_init_entity_util_avg(struct task_struct *p)
 	sa->runnable_avg = sa->util_avg;
 }
 
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec);
+
 static s64 update_se(struct rq *rq, struct sched_entity *se)
 {
 	u64 now = rq_clock_task(rq);
@@ -1174,6 +1176,7 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 
 		trace_sched_stat_runtime(running, delta_exec);
 		account_group_exec_runtime(running, delta_exec);
+		account_mm_sched(rq, running, delta_exec);
 
 		/* cgroup time is always accounted against the donor */
 		cgroup_account_cputime(donor, delta_exec);
@@ -1193,6 +1196,259 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 	return delta_exec;
 }
 
+#ifdef CONFIG_SCHED_CACHE
+
+/*
+ * XXX numbers come from a place the sun don't shine -- probably wants to be SD
+ * tunable or so.
+ */
+#define EPOCH_PERIOD	(HZ / 100)	/* 10 ms */
+#define EPOCH_LLC_AFFINITY_TIMEOUT	5	/* 50 ms */
+
+static int llc_id(int cpu)
+{
+	if (cpu < 0)
+		return -1;
+
+	return per_cpu(sd_llc_id, cpu);
+}
+
+void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
+{
+	unsigned long epoch;
+	int i;
+
+	for_each_possible_cpu(i) {
+		struct mm_sched *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
+		struct rq *rq = cpu_rq(i);
+
+		pcpu_sched->runtime = 0;
+		pcpu_sched->epoch = rq->cpu_epoch;
+		epoch = rq->cpu_epoch;
+	}
+
+	raw_spin_lock_init(&mm->mm_sched_lock);
+	mm->mm_sched_epoch = epoch;
+	mm->mm_sched_cpu = -1;
+
+	/*
+	 * The update to mm->pcpu_sched should not be reordered
+	 * before initialization to mm's other fields, in case
+	 * the readers may get invalid mm_sched_epoch, etc.
+	 */
+	smp_store_release(&mm->pcpu_sched, _pcpu_sched);
+}
+
+/* because why would C be fully specified */
+static __always_inline void __shr_u64(u64 *val, unsigned int n)
+{
+	if (n >= 64) {
+		*val = 0;
+		return;
+	}
+	*val >>= n;
+}
+
+static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
+{
+	lockdep_assert_held(&rq->cpu_epoch_lock);
+
+	unsigned long n, now = jiffies;
+	long delta = now - rq->cpu_epoch_next;
+
+	if (delta > 0) {
+		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		rq->cpu_epoch += n;
+		rq->cpu_epoch_next += n * EPOCH_PERIOD;
+		__shr_u64(&rq->cpu_runtime, n);
+	}
+
+	n = rq->cpu_epoch - pcpu_sched->epoch;
+	if (n) {
+		pcpu_sched->epoch += n;
+		__shr_u64(&pcpu_sched->runtime, n);
+	}
+}
+
+static unsigned long __no_profile fraction_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
+{
+	guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
+
+	__update_mm_sched(rq, pcpu_sched);
+
+	/*
+	 * Runtime is a geometric series (r=0.5) and as such will sum to twice
+	 * the accumulation period, this means the multiplcation here should
+	 * not overflow.
+	 */
+	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
+}
+
+static inline
+void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
+{
+	struct mm_struct *mm = p->mm;
+	struct mm_sched *pcpu_sched;
+	unsigned long epoch;
+
+	if (!sched_cache_enabled())
+		return;
+
+	if (p->sched_class != &fair_sched_class)
+		return;
+	/*
+	 * init_task and kthreads don't having mm
+	 */
+	if (!mm || !mm->pcpu_sched)
+		return;
+
+	pcpu_sched = per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq));
+
+	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
+		__update_mm_sched(rq, pcpu_sched);
+		pcpu_sched->runtime += delta_exec;
+		rq->cpu_runtime += delta_exec;
+		epoch = rq->cpu_epoch;
+	}
+
+	/*
+	 * If this task hasn't hit task_cache_work() for a while, or it
+	 * has only 1 thread, invalidate its preferred state.
+	 */
+	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
+	    get_nr_threads(p) <= 1) {
+		if (mm->mm_sched_cpu != -1)
+			mm->mm_sched_cpu = -1;
+	}
+}
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p)
+{
+	struct callback_head *work = &p->cache_work;
+	struct mm_struct *mm = p->mm;
+
+	if (!sched_cache_enabled())
+		return;
+
+	if (!mm || !mm->pcpu_sched)
+		return;
+
+	/* avoid moving backwards */
+	if (mm->mm_sched_epoch >= rq->cpu_epoch)
+		return;
+
+	guard(raw_spinlock)(&mm->mm_sched_lock);
+
+	if (work->next == work) {
+		task_work_add(p, work, TWA_RESUME);
+		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
+	}
+}
+
+static void __no_profile task_cache_work(struct callback_head *work)
+{
+	struct task_struct *p = current;
+	struct mm_struct *mm = p->mm;
+	unsigned long m_a_occ = 0;
+	unsigned long curr_m_a_occ = 0;
+	int cpu, m_a_cpu = -1;
+	cpumask_var_t cpus;
+
+	WARN_ON_ONCE(work != &p->cache_work);
+
+	work->next = work;
+
+	if (p->flags & PF_EXITING)
+		return;
+
+	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
+		return;
+
+	scoped_guard (cpus_read_lock) {
+		cpumask_copy(cpus, cpu_online_mask);
+
+		for_each_cpu(cpu, cpus) {
+			/* XXX sched_cluster_active */
+			struct sched_domain *sd = per_cpu(sd_llc, cpu);
+			unsigned long occ, m_occ = 0, a_occ = 0;
+			int m_cpu = -1, i;
+
+			if (!sd)
+				continue;
+
+			for_each_cpu(i, sched_domain_span(sd)) {
+				occ = fraction_mm_sched(cpu_rq(i),
+							per_cpu_ptr(mm->pcpu_sched, i));
+				a_occ += occ;
+				if (occ > m_occ) {
+					m_occ = occ;
+					m_cpu = i;
+				}
+			}
+
+			/*
+			 * Compare the accumulated occupancy of each LLC. The
+			 * reason for using accumulated occupancy rather than average
+			 * per CPU occupancy is that it works better in asymmetric LLC
+			 * scenarios.
+			 * For example, if there are 2 threads in a 4CPU LLC and 3
+			 * threads in an 8CPU LLC, it might be better to choose the one
+			 * with 3 threads. However, this would not be the case if the
+			 * occupancy is divided by the number of CPUs in an LLC (i.e.,
+			 * if average per CPU occupancy is used).
+			 * Besides, NUMA balancing fault statistics behave similarly:
+			 * the total number of faults per node is compared rather than
+			 * the average number of faults per CPU. This strategy is also
+			 * followed here.
+			 */
+			if (a_occ > m_a_occ) {
+				m_a_occ = a_occ;
+				m_a_cpu = m_cpu;
+			}
+
+			if (llc_id(cpu) == llc_id(mm->mm_sched_cpu))
+				curr_m_a_occ = a_occ;
+
+			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+		}
+	}
+
+	if (m_a_occ > (2 * curr_m_a_occ)) {
+		/*
+		 * Avoid switching mm_sched_cpu too fast.
+		 * The reason to choose 2X is because:
+		 * 1. It is better to keep the preferred LLC stable,
+		 *    rather than changing it frequently and cause migrations
+		 * 2. 2X means the new preferred LLC has at least 1 more
+		 *    busy CPU than the old one(200% vs 100%, eg)
+		 * 3. 2X is chosen based on test results, as it delivers
+		 *    the optimal performance gain so far.
+		 */
+		mm->mm_sched_cpu = m_a_cpu;
+	}
+
+	free_cpumask_var(cpus);
+}
+
+void init_sched_mm(struct task_struct *p)
+{
+	struct callback_head *work = &p->cache_work;
+
+	init_task_work(work, task_cache_work);
+	work->next = work;
+}
+
+#else
+
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
+				    s64 delta_exec) { }
+
+void init_sched_mm(struct task_struct *p) { }
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
+
+#endif
+
 /*
  * Used by other classes to account runtime.
  */
@@ -13124,6 +13380,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
+	task_tick_cache(rq, curr);
+
 	update_misfit_status(curr, rq);
 	check_update_overutilized_status(task_rq(curr));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index adfb6e3409d7..84118b522f22 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1194,6 +1194,12 @@ struct rq {
 	u64			clock_pelt_idle_copy;
 	u64			clock_idle_copy;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	raw_spinlock_t		cpu_epoch_lock ____cacheline_aligned;
+	u64			cpu_runtime;
+	unsigned long		cpu_epoch;
+	unsigned long		cpu_epoch_next;
+#endif
 
 	atomic_t		nr_iowait;
 
@@ -3819,6 +3825,8 @@ static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) { }
 static inline void init_sched_mm_cid(struct task_struct *t) { }
 #endif /* !CONFIG_SCHED_MM_CID */
 
+extern void init_sched_mm(struct task_struct *p);
+
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 static inline
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 02/23] sched/cache: Record per-LLC utilization to guide cache-aware scheduling decisions
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
  2025-12-03 23:07 ` [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-09 11:21   ` Peter Zijlstra
  2025-12-03 23:07 ` [PATCH v2 03/23] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

When a system becomes busy and a process’s preferred LLC is
saturated with too many threads, tasks within that LLC migrate
frequently. These in LLC migrations introduce latency and degrade
performance. To avoid this, task aggregation should be suppressed when
the preferred LLC is overloaded, which requires a metric to indicate
LLC utilization.

Record per LLC utilization/cpu capacity during periodic load
balancing. These statistics will be used in later patches to decide
whether tasks should be aggregated into their preferred LLC.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
       Refine the comments in record_sg_llc_stats().(Peter Zijlstra).

 include/linux/sched/topology.h |  4 ++
 kernel/sched/fair.c            | 69 ++++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index bbcfdf12aa6e..0ba4697d74ba 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -68,6 +68,10 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned long	util_avg;
+	unsigned long	capacity ____cacheline_aligned_in_smp;
+#endif
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cb82f558dc5b..b9f336300f14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9622,6 +9622,29 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
 	return 0;
 }
 
+#ifdef CONFIG_SCHED_CACHE
+/* Called from load balancing paths with rcu_read_lock held */
+static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
+					 unsigned long *cap)
+{
+	struct sched_domain_shared *sd_share;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (!sd_share)
+		return false;
+
+	*util = READ_ONCE(sd_share->util_avg);
+	*cap = READ_ONCE(sd_share->capacity);
+
+	return true;
+}
+#else
+static inline bool get_llc_stats(int cpu, unsigned long *util,
+				 unsigned long *cap)
+{
+	return false;
+}
+#endif
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  */
@@ -10592,6 +10615,51 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
 
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Record the statistics for this scheduler group for later
+ * use. These values guide load balancing on aggregating tasks
+ * to a LLC.
+ */
+static void record_sg_llc_stats(struct lb_env *env,
+				struct sg_lb_stats *sgs,
+				struct sched_group *group)
+{
+	struct sched_domain_shared *sd_share;
+
+	if (!sched_cache_enabled() || env->idle == CPU_NEWLY_IDLE)
+		return;
+
+	/* Only care about sched domain spanning multiple LLCs */
+	if (env->sd->child != rcu_dereference(per_cpu(sd_llc, env->dst_cpu)))
+		return;
+
+	/*
+	 * At this point we know this group spans a LLC domain.
+	 * Record the statistic of this group in its corresponding
+	 * shared LLC domain.
+	 * Note: sd_share cannot be obtained via sd->child->shared, because
+	 * it refers to the domain that covers the local group, while
+	 * sd_share could represent any of the LLC group.
+	 */
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared,
+					   cpumask_first(sched_group_span(group))));
+	if (!sd_share)
+		return;
+
+	if (READ_ONCE(sd_share->util_avg) != sgs->group_util)
+		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
+
+	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
+		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
+}
+#else
+static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
+				       struct sched_group *group)
+{
+}
+#endif
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10681,6 +10749,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
+	record_sg_llc_stats(env, sgs, group);
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type == group_overloaded)
 		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 03/23] sched/cache: Introduce helper functions to enforce LLC migration policy
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
  2025-12-03 23:07 ` [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
  2025-12-03 23:07 ` [PATCH v2 02/23] sched/cache: Record per-LLC utilization to guide cache-aware scheduling decisions Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-03 23:07 ` [PATCH v2 04/23] sched/cache: Make LLC id continuous Tim Chen
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Cache-aware scheduling aggregates threads onto their preferred LLC,
mainly through load balancing. When the preferred LLC becomes
saturated, more threads are still placed there, increasing latency.
A mechanism is needed to limit aggregation so that the preferred LLC
does not become overloaded.

Introduce helper functions can_migrate_llc() and
can_migrate_llc_task() to enforce the LLC migration policy:

  1. Aggregate a task to its preferred LLC if both source and
     destination LLCs are not too busy (<50% utilization),
     or if doing so will not leave the preferred LLC much more
     imbalanced than the non-preferred one (>20% utilization
     difference, similar to imbalance_pct of the LLC domain).
  2. Allow moving a task from overloaded preferred LLC to a non preferred
     LLC if this will not cause the non preferred LLC to become
     too imbalanced to cause a later migration back.
  3. If both LLCs are too busy, let the generic load balance to spread
     the tasks.

Further (hysteresis)action could be taken in the future to prevent tasks
from being migrated into and out of the preferred LLC frequently (back and
forth): the threshold for migrating a task out of its preferred LLC should
be higher than that for migrating it into the LLC.

Since aggregation tends to make the preferred LLC busier than others,
the imbalance tolerance is controlled by llc_imb_pct. If set to 0,
tasks may still aggregate to the preferred LLC as long as it is
not more utilized than the source LLC, preserving the preference.

Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
       No change.

 kernel/sched/fair.c  | 153 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |   5 ++
 2 files changed, 158 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9f336300f14..710ed9943d27 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1205,6 +1205,9 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 #define EPOCH_PERIOD	(HZ / 100)	/* 10 ms */
 #define EPOCH_LLC_AFFINITY_TIMEOUT	5	/* 50 ms */
 
+__read_mostly unsigned int llc_overload_pct       = 50;
+__read_mostly unsigned int llc_imb_pct            = 20;
+
 static int llc_id(int cpu)
 {
 	if (cpu < 0)
@@ -9623,6 +9626,27 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
 }
 
 #ifdef CONFIG_SCHED_CACHE
+/*
+ * The margin used when comparing LLC utilization with CPU capacity.
+ * Parameter llc_overload_pct determines the LLC load level where
+ * active LLC aggregation is done.
+ * Derived from fits_capacity().
+ *
+ * (default: ~50%)
+ */
+#define fits_llc_capacity(util, max)	\
+	((util) * 100 < (max) * llc_overload_pct)
+
+/*
+ * The margin used when comparing utilization.
+ * is 'util1' noticeably greater than 'util2'
+ * Derived from capacity_greater().
+ * Bias is in perentage.
+ */
+/* Allows dst util to be bigger than src util by up to bias percent */
+#define util_greater(util1, util2) \
+	((util1) * 100 > (util2) * (100 + llc_imb_pct))
+
 /* Called from load balancing paths with rcu_read_lock held */
 static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
 					 unsigned long *cap)
@@ -9638,6 +9662,135 @@ static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
 
 	return true;
 }
+
+/*
+ * Decision matrix according to the LLC utilization. To
+ * decide whether we can do task aggregation across LLC.
+ *
+ * By default, 50% is the threshold to treat the LLC as busy,
+ * and 20% is the utilization imbalance percentage to decide
+ * if the preferred LLC is busier than the non-preferred LLC.
+ * The hysteresis is used to avoid task bouncing between the
+ * preferred LLC and the non-preferred LLC.
+ *
+ * 1. moving towards the preferred LLC, dst is the preferred
+ *    LLC, src is not.
+ *
+ * src \ dst      30%  40%  50%  60%
+ * 30%            Y    Y    Y    N
+ * 40%            Y    Y    Y    Y
+ * 50%            Y    Y    G    G
+ * 60%            Y    Y    G    G
+ *
+ * 2. moving out of the preferred LLC, src is the preferred
+ *    LLC, dst is not:
+ *
+ * src \ dst      30%  40%  50%  60%
+ * 30%            N    N    N    N
+ * 40%            N    N    N    N
+ * 50%            N    N    G    G
+ * 60%            Y    N    G    G
+ *
+ * src :      src_util
+ * dst :      dst_util
+ * Y :        Yes, migrate
+ * N :        No, do not migrate
+ * G :        let the Generic load balance to even the load.
+ *
+ * The intention is that if both LLCs are quite busy, cache aware
+ * load balance should not be performed, and generic load balance
+ * should take effect. However, if one is busy and the other is not,
+ * the preferred LLC capacity(50%) and imbalance criteria(20%) should
+ * be considered to determine whether LLC aggregation should be
+ * performed to bias the load towards the preferred LLC.
+ */
+
+/* migration decision, 3 states are orthogonal. */
+enum llc_mig {
+	mig_forbid = 0,		/* N: Don't migrate task, respect LLC preference */
+	mig_llc,		/* Y: Do LLC preference based migration */
+	mig_unrestricted	/* G: Don't restrict generic load balance migration */
+};
+
+/*
+ * Check if task can be moved from the source LLC to the
+ * destination LLC without breaking cache aware preferrence.
+ * src_cpu and dst_cpu are arbitrary CPUs within the source
+ * and destination LLCs, respectively.
+ */
+static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
+				    unsigned long tsk_util,
+				    bool to_pref)
+{
+	unsigned long src_util, dst_util, src_cap, dst_cap;
+
+	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
+	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
+		return mig_unrestricted;
+
+	if (!fits_llc_capacity(dst_util, dst_cap) &&
+	    !fits_llc_capacity(src_util, src_cap))
+		return mig_unrestricted;
+
+	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
+	dst_util = dst_util + tsk_util;
+	if (to_pref) {
+		/*
+		 * llc_imb_pct is the imbalance allowed between
+		 * preferred LLC and non-preferred LLC.
+		 * Don't migrate if we will get preferred LLC too
+		 * heavily loaded and if the dest is much busier
+		 * than the src, in which case migration will
+		 * increase the imbalance too much.
+		 */
+		if (!fits_llc_capacity(dst_util, dst_cap) &&
+		    util_greater(dst_util, src_util))
+			return mig_forbid;
+	} else {
+		/*
+		 * Don't migrate if we will leave preferred LLC
+		 * too idle, or if this migration leads to the
+		 * non-preferred LLC falls within sysctl_aggr_imb percent
+		 * of preferred LLC, leading to migration again
+		 * back to preferred LLC.
+		 */
+		if (fits_llc_capacity(src_util, src_cap) ||
+		    !util_greater(src_util, dst_util))
+			return mig_forbid;
+	}
+	return mig_llc;
+}
+
+/*
+ * Check if task p can migrate from source LLC to
+ * destination LLC in terms of cache aware load balance.
+ */
+static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
+							struct task_struct *p)
+{
+	struct mm_struct *mm;
+	bool to_pref;
+	int cpu;
+
+	mm = p->mm;
+	if (!mm)
+		return mig_unrestricted;
+
+	cpu = mm->mm_sched_cpu;
+	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
+		return mig_unrestricted;
+
+	if (cpus_share_cache(dst_cpu, cpu))
+		to_pref = true;
+	else if (cpus_share_cache(src_cpu, cpu))
+		to_pref = false;
+	else
+		return mig_unrestricted;
+
+	return can_migrate_llc(src_cpu, dst_cpu,
+			       task_util(p), to_pref);
+}
+
 #else
 static inline bool get_llc_stats(int cpu, unsigned long *util,
 				 unsigned long *cap)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84118b522f22..bf72c5bab506 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2828,6 +2828,11 @@ extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_hot_threshold;
 
+#ifdef CONFIG_SCHED_CACHE
+extern unsigned int llc_overload_pct;
+extern unsigned int llc_imb_pct;
+#endif
+
 #ifdef CONFIG_SCHED_HRTICK
 
 /*
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 04/23] sched/cache: Make LLC id continuous
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (2 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 03/23] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-09 11:58   ` Peter Zijlstra
  2025-12-23  5:31   ` K Prateek Nayak
  2025-12-03 23:07 ` [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes Tim Chen
                   ` (19 subsequent siblings)
  23 siblings, 2 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

Introduce an index mapping between CPUs and their LLCs. This provides
a continuous per LLC index needed for cache-aware load balancing in
later patches.

The existing per_cpu llc_id usually points to the first CPU of the
LLC domain, which is sparse and unsuitable as an array index. Using
llc_id directly would waste memory.

With the new mapping, CPUs in the same LLC share a continuous id:

  per_cpu(llc_id, CPU=0...15)  = 0
  per_cpu(llc_id, CPU=16...31) = 1
  per_cpu(llc_id, CPU=32...47) = 2
  ...

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
       Convert the static LLC id to be allocated sequentially as LLCs are
       discovered, and replace the old sd_llc_id. (Peter Zijlstra)

 kernel/sched/fair.c     |  9 ++++++-
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c | 60 +++++++++++++++++++++++++++++++++++++++--
 3 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 710ed9943d27..0a3918269906 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1210,10 +1210,17 @@ __read_mostly unsigned int llc_imb_pct            = 20;
 
 static int llc_id(int cpu)
 {
+	int llc;
+
 	if (cpu < 0)
 		return -1;
 
-	return per_cpu(sd_llc_id, cpu);
+	llc = per_cpu(sd_llc_id, cpu);
+	/* avoid race with cpu hotplug */
+	if (unlikely(llc >= max_llcs))
+		return -1;
+
+	return llc;
 }
 
 void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bf72c5bab506..728737641847 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2075,6 +2075,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
 
 extern struct static_key_false sched_asym_cpucapacity;
 extern struct static_key_false sched_cluster_active;
+extern int max_llcs;
 
 static __always_inline bool sched_asym_cpucap_active(void)
 {
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 444bdfdab731..f25d950ab015 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -17,6 +17,8 @@ void sched_domains_mutex_unlock(void)
 	mutex_unlock(&sched_domains_mutex);
 }
 
+int max_llcs;
+
 /* Protected by sched_domains_mutex: */
 static cpumask_var_t sched_domains_tmpmask;
 static cpumask_var_t sched_domains_tmpmask2;
@@ -668,6 +670,55 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
 DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
 DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
 
+/*
+ * Assign continuous llc id for the CPU, and return
+ * the assigned llc id.
+ */
+static int update_llc_id(struct sched_domain *sd,
+			 int cpu)
+{
+	int id = per_cpu(sd_llc_id, cpu), i;
+
+	if (id >= 0)
+		return id;
+
+	if (sd) {
+		/* Look for any assigned id and reuse it.*/
+		for_each_cpu(i, sched_domain_span(sd)) {
+			id = per_cpu(sd_llc_id, i);
+
+			if (id >= 0) {
+				per_cpu(sd_llc_id, cpu) = id;
+				return id;
+			}
+		}
+	}
+
+	/*
+	 * When 1. there is no id assigned to this LLC domain,
+	 * or 2. the sd is NULL, we reach here.
+	 * Consider the following scenario,
+	 * CPU0~CPU95 are in the node0, CPU96~CPU191 are
+	 * in the node1. During bootup, maxcpus=96 is
+	 * appended.
+	 * case 1: When running cpu_attach_domain(CPU24)
+	 * during boot up, CPU24 is the first CPU in its
+	 * non-NULL LLC domain. However,
+	 * its corresponding llc id has not been assigned yet.
+	 *
+	 * case 2: After boot up, the CPU100 is brought up
+	 * via sysfs manually. As a result, CPU100 has only a
+	 * Numa domain attached, because CPU100 is the only CPU
+	 * of a sched domain, all its bottom domains are degenerated.
+	 * The LLC domain pointer sd is NULL for CPU100.
+	 *
+	 * For both cases, we want to increase the number of LLCs.
+	 */
+	per_cpu(sd_llc_id, cpu) = max_llcs++;
+
+	return per_cpu(sd_llc_id, cpu);
+}
+
 static void update_top_cache_domain(int cpu)
 {
 	struct sched_domain_shared *sds = NULL;
@@ -677,14 +728,13 @@ static void update_top_cache_domain(int cpu)
 
 	sd = highest_flag_domain(cpu, SD_SHARE_LLC);
 	if (sd) {
-		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
 		sds = sd->shared;
 	}
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
-	per_cpu(sd_llc_id, cpu) = id;
+	id = update_llc_id(sd, cpu);
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_CLUSTER);
@@ -2488,6 +2538,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	bool has_asym = false;
 	bool has_cluster = false;
 
+	/* first scan of LLCs */
+	if (!max_llcs) {
+		for_each_possible_cpu(i)
+			per_cpu(sd_llc_id, i) = -1;
+	}
+
 	if (WARN_ON(cpumask_empty(cpu_map)))
 		goto error;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (3 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 04/23] sched/cache: Make LLC id continuous Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-09 12:11   ` Peter Zijlstra
  2025-12-12  3:34   ` Vern Hao
  2025-12-03 23:07 ` [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
                   ` (18 subsequent siblings)
  23 siblings, 2 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

With cache-aware scheduling enabled, each task is assigned a
preferred LLC ID. This allows quick identification of the LLC domain
where the task prefers to run, similar to numa_preferred_nid in
NUMA balancing.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Align preferred LLC with NUMA balancing's preferred node.

 include/linux/sched.h |  1 +
 init/init_task.c      |  3 +++
 kernel/sched/fair.c   | 18 ++++++++++++++++++
 3 files changed, 22 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 278b529c91df..1ad46220cd04 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1408,6 +1408,7 @@ struct task_struct {
 
 #ifdef CONFIG_SCHED_CACHE
 	struct callback_head		cache_work;
+	int				preferred_llc;
 #endif
 
 #ifdef CONFIG_RSEQ
diff --git a/init/init_task.c b/init/init_task.c
index a55e2189206f..44bae72b5b7d 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -191,6 +191,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.numa_group	= NULL,
 	.numa_faults	= NULL,
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	.preferred_llc  = -1,
+#endif
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 	.kasan_depth	= 1,
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a3918269906..10cec83f65d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1300,6 +1300,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	struct mm_struct *mm = p->mm;
 	struct mm_sched *pcpu_sched;
 	unsigned long epoch;
+	int mm_sched_llc = -1;
 
 	if (!sched_cache_enabled())
 		return;
@@ -1330,6 +1331,23 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 		if (mm->mm_sched_cpu != -1)
 			mm->mm_sched_cpu = -1;
 	}
+
+	if (mm->mm_sched_cpu != -1) {
+		mm_sched_llc = llc_id(mm->mm_sched_cpu);
+
+#ifdef CONFIG_NUMA_BALANCING
+		/*
+		 * Don't assign preferred LLC if it
+		 * conflicts with NUMA balancing.
+		 */
+		if (p->numa_preferred_nid >= 0 &&
+		    cpu_to_node(mm->mm_sched_cpu) != p->numa_preferred_nid)
+			mm_sched_llc = -1;
+#endif
+	}
+
+	if (p->preferred_llc != mm_sched_llc)
+		p->preferred_llc = mm_sched_llc;
 }
 
 static void task_tick_cache(struct rq *rq, struct task_struct *p)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (4 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-09 12:16   ` Peter Zijlstra
  2025-12-17 10:04   ` Vern Hao
  2025-12-03 23:07 ` [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter Tim Chen
                   ` (17 subsequent siblings)
  23 siblings, 2 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

For each runqueue, track the number of tasks with an LLC preference
and how many of them are running on their preferred LLC. This mirrors
nr_numa_running and nr_preferred_running for NUMA balancing, and will
be used by cache-aware load balancing in later patches.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Invoke task_of() once and reuse its result afterwards.
            (Peter Zijlstra)
            Remove hacky reset_llc_stats() and introduce sched_llc_active flag
            to properly pair enqueue/dequeue statistics update (Peter Zijlstra, K Prateek Nayak)

 include/linux/sched.h |  2 ++
 init/init_task.c      |  1 +
 kernel/sched/core.c   |  5 ++++
 kernel/sched/fair.c   | 60 ++++++++++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h  |  6 +++++
 5 files changed, 71 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1ad46220cd04..466ba8b7398c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1408,6 +1408,8 @@ struct task_struct {
 
 #ifdef CONFIG_SCHED_CACHE
 	struct callback_head		cache_work;
+	/*the p is currently refcounted in a rq's preferred llc stats*/
+	bool				sched_llc_active;
 	int				preferred_llc;
 #endif
 
diff --git a/init/init_task.c b/init/init_task.c
index 44bae72b5b7d..ee78837b0aa2 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -192,6 +192,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.numa_faults	= NULL,
 #endif
 #ifdef CONFIG_SCHED_CACHE
+	.sched_llc_active = false,
 	.preferred_llc  = -1,
 #endif
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8bdf03a4b7f..48626c81ba8e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -531,6 +531,11 @@ void __trace_set_current_state(int state_value)
 }
 EXPORT_SYMBOL(__trace_set_current_state);
 
+int task_llc(const struct task_struct *p)
+{
+	return per_cpu(sd_llc_id, task_cpu(p));
+}
+
 /*
  * Serialization rules:
  *
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 10cec83f65d5..d46a70a9d9fb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1223,6 +1223,43 @@ static int llc_id(int cpu)
 	return llc;
 }
 
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
+{
+	int pref_llc;
+
+	if (!sched_cache_enabled())
+		return;
+
+	pref_llc = p->preferred_llc;
+	if (pref_llc < 0)
+		return;
+
+	rq->nr_llc_running++;
+	rq->nr_pref_llc_running += (pref_llc == task_llc(p));
+	p->sched_llc_active = true;
+}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
+{
+	int pref_llc;
+
+	/*
+	 * Borrow the uc_se->active from uclamp_rq_inc_id(),
+	 * uclamp_rq_dec_id() to avoid the unbalanced calculation
+	 * of rq statistics.
+	 */
+	if (unlikely(!p->sched_llc_active))
+		return;
+
+	pref_llc = p->preferred_llc;
+	if (pref_llc < 0)
+		return;
+
+	rq->nr_llc_running--;
+	rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
+	p->sched_llc_active = false;
+}
+
 void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
 {
 	unsigned long epoch;
@@ -1294,6 +1331,8 @@ static unsigned long __no_profile fraction_mm_sched(struct rq *rq, struct mm_sch
 	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
 }
 
+static unsigned int task_running_on_cpu(int cpu, struct task_struct *p);
+
 static inline
 void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
@@ -1346,8 +1385,13 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 #endif
 	}
 
-	if (p->preferred_llc != mm_sched_llc)
+	/* task not on rq accounted later in account_entity_enqueue() */
+	if (task_running_on_cpu(rq->cpu, p) &&
+	    p->preferred_llc != mm_sched_llc) {
+		account_llc_dequeue(rq, p);
 		p->preferred_llc = mm_sched_llc;
+		account_llc_enqueue(rq, p);
+	}
 }
 
 static void task_tick_cache(struct rq *rq, struct task_struct *p)
@@ -1475,6 +1519,10 @@ void init_sched_mm(struct task_struct *p) { }
 
 static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
 
+static void account_llc_enqueue(struct rq *rq, struct task_struct *p) {}
+
+static void account_llc_dequeue(struct rq *rq, struct task_struct *p) {}
+
 #endif
 
 /*
@@ -3965,9 +4013,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	update_load_add(&cfs_rq->load, se->load.weight);
 	if (entity_is_task(se)) {
+		struct task_struct *p = task_of(se);
 		struct rq *rq = rq_of(cfs_rq);
 
-		account_numa_enqueue(rq, task_of(se));
+		account_numa_enqueue(rq, p);
+		account_llc_enqueue(rq, p);
 		list_add(&se->group_node, &rq->cfs_tasks);
 	}
 	cfs_rq->nr_queued++;
@@ -3978,7 +4028,11 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (entity_is_task(se)) {
-		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+		struct task_struct *p = task_of(se);
+		struct rq *rq = rq_of(cfs_rq);
+
+		account_numa_dequeue(rq, p);
+		account_llc_dequeue(rq, p);
 		list_del_init(&se->group_node);
 	}
 	cfs_rq->nr_queued--;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 728737641847..ee8b70647835 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1126,6 +1126,10 @@ struct rq {
 	unsigned int		nr_preferred_running;
 	unsigned int		numa_migrate_on;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int		nr_pref_llc_running;
+	unsigned int		nr_llc_running;
+#endif
 #ifdef CONFIG_NO_HZ_COMMON
 	unsigned long		last_blocked_load_update_tick;
 	unsigned int		has_blocked_load;
@@ -1980,6 +1984,8 @@ init_numa_balancing(u64 clone_flags, struct task_struct *p)
 
 #endif /* !CONFIG_NUMA_BALANCING */
 
+int task_llc(const struct task_struct *p);
+
 static inline void
 queue_balance_callback(struct rq *rq,
 		       struct balance_callback *head,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (5 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-09 13:06   ` Peter Zijlstra
                     ` (2 more replies)
  2025-12-03 23:07 ` [PATCH v2 08/23] sched/cache: Calculate the per runqueue task LLC preference Tim Chen
                   ` (16 subsequent siblings)
  23 siblings, 3 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

Each runqueue is assigned an array where each element tracks
the number of tasks preferring a given LLC, indexed from 0 to
max_llcs - 1.

For example, rq->nr_pref_llc[3] = 2 signifies that there are 2 tasks on
this runqueue which prefer to run within LLC3.

The load balancer can use this information to identify busy
runqueues and migrate tasks to their preferred LLC domains.
This array will be reallocated at runtime if the number of LLCs
increases due to CPU hotplug. Only extending the buffer(rather
than shrinking it) is supported to simplify the implementation.

Introduce the buffer allocation mechanism, and the statistics
will be calculated in the subsequent patch.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
        Remove static allocation of per runqueue LLC preference arrays.
        Allocate array size to the actual number of LLCs online. (Peter Zijlstra, Madadi Vineeth Reddy)

 kernel/sched/core.c     |   1 +
 kernel/sched/sched.h    |   1 +
 kernel/sched/topology.c | 117 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48626c81ba8e..ce533dc485f5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8800,6 +8800,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SCHED_CACHE
 		raw_spin_lock_init(&rq->cpu_epoch_lock);
 		rq->cpu_epoch_next = jiffies;
+		rq->nr_pref_llc = NULL;
 #endif
 
 		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ee8b70647835..8f2a779825e4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1129,6 +1129,7 @@ struct rq {
 #ifdef CONFIG_SCHED_CACHE
 	unsigned int		nr_pref_llc_running;
 	unsigned int		nr_llc_running;
+	unsigned int		*nr_pref_llc;
 #endif
 #ifdef CONFIG_NO_HZ_COMMON
 	unsigned long		last_blocked_load_update_tick;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index f25d950ab015..d583399fc6a1 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -17,8 +17,121 @@ void sched_domains_mutex_unlock(void)
 	mutex_unlock(&sched_domains_mutex);
 }
 
+/* the number of max LLCs being detected */
+static int new_max_llcs;
+/* the current number of max LLCs */
 int max_llcs;
 
+#ifdef CONFIG_SCHED_CACHE
+
+static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int **gc)
+{
+	unsigned int *new = NULL;
+
+	new = kcalloc(new_max_llcs, sizeof(unsigned int),
+		      GFP_KERNEL | __GFP_NOWARN);
+
+	if (!new) {
+		*gc = NULL;
+	} else {
+		/*
+		 * Place old entry in garbage collector
+		 * for later disposal.
+		 */
+		*gc = old;
+	}
+	return new;
+}
+
+static void populate_new_pref_llcs(unsigned int *old, unsigned int *new)
+{
+	int i;
+
+	if (!old)
+		return;
+
+	for (i = 0; i < max_llcs; i++)
+		new[i] = old[i];
+}
+
+static int resize_llc_pref(void)
+{
+	unsigned int *__percpu *tmp_llc_pref;
+	int i, ret = 0;
+
+	if (new_max_llcs <= max_llcs)
+		return 0;
+
+	/*
+	 * Allocate temp percpu pointer for old llc_pref,
+	 * which will be released after switching to the
+	 * new buffer.
+	 */
+	tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
+	if (!tmp_llc_pref)
+		return -ENOMEM;
+
+	for_each_present_cpu(i)
+		*per_cpu_ptr(tmp_llc_pref, i) = NULL;
+
+	/*
+	 * Resize the per rq nr_pref_llc buffer and
+	 * switch to this new buffer.
+	 */
+	for_each_present_cpu(i) {
+		struct rq_flags rf;
+		unsigned int *new;
+		struct rq *rq;
+
+		rq = cpu_rq(i);
+		new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
+		if (!new) {
+			ret = -ENOMEM;
+
+			goto release_old;
+		}
+
+		/*
+		 * Locking rq ensures that rq->nr_pref_llc values
+		 * don't change with new task enqueue/dequeue
+		 * when we repopulate the newly enlarged array.
+		 */
+		rq_lock_irqsave(rq, &rf);
+		populate_new_pref_llcs(rq->nr_pref_llc, new);
+		rq->nr_pref_llc = new;
+		rq_unlock_irqrestore(rq, &rf);
+	}
+
+release_old:
+	/*
+	 * Load balance is done under rcu_lock.
+	 * Wait for load balance before and during resizing to
+	 * be done. They may refer to old nr_pref_llc[]
+	 * that hasn't been resized.
+	 */
+	synchronize_rcu();
+	for_each_present_cpu(i)
+		kfree(*per_cpu_ptr(tmp_llc_pref, i));
+
+	free_percpu(tmp_llc_pref);
+
+	/* succeed and update */
+	if (!ret)
+		max_llcs = new_max_llcs;
+
+	return ret;
+}
+
+#else
+
+static int resize_llc_pref(void)
+{
+	max_llcs = new_max_llcs;
+	return 0;
+}
+
+#endif
+
 /* Protected by sched_domains_mutex: */
 static cpumask_var_t sched_domains_tmpmask;
 static cpumask_var_t sched_domains_tmpmask2;
@@ -714,7 +827,7 @@ static int update_llc_id(struct sched_domain *sd,
 	 *
 	 * For both cases, we want to increase the number of LLCs.
 	 */
-	per_cpu(sd_llc_id, cpu) = max_llcs++;
+	per_cpu(sd_llc_id, cpu) = new_max_llcs++;
 
 	return per_cpu(sd_llc_id, cpu);
 }
@@ -2674,6 +2787,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	if (has_cluster)
 		static_branch_inc_cpuslocked(&sched_cluster_active);
 
+	resize_llc_pref();
+
 	if (rq && sched_debug_verbose)
 		pr_info("root domain span: %*pbl\n", cpumask_pr_args(cpu_map));
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 08/23] sched/cache: Calculate the per runqueue task LLC preference
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (6 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-03 23:07 ` [PATCH v2 09/23] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

Calculate the number of tasks' LLC preferences for each runqueue.
This statistic is computed during task enqueue and dequeue
operations, and is used by the cache-aware load balancing.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Split from previous patch for easier review.

 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d46a70a9d9fb..b0e87616e377 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1231,11 +1231,12 @@ static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 		return;
 
 	pref_llc = p->preferred_llc;
-	if (pref_llc < 0)
+	if (pref_llc < 0 || pref_llc >= max_llcs)
 		return;
 
 	rq->nr_llc_running++;
 	rq->nr_pref_llc_running += (pref_llc == task_llc(p));
+	rq->nr_pref_llc[pref_llc]++;
 	p->sched_llc_active = true;
 }
 
@@ -1252,11 +1253,12 @@ static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
 		return;
 
 	pref_llc = p->preferred_llc;
-	if (pref_llc < 0)
+	if (pref_llc < 0 || pref_llc >= max_llcs)
 		return;
 
 	rq->nr_llc_running--;
 	rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
+	rq->nr_pref_llc[pref_llc]--;
 	p->sched_llc_active = false;
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 09/23] sched/cache: Count tasks prefering destination LLC in a sched group
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (7 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 08/23] sched/cache: Calculate the per runqueue task LLC preference Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-10 12:52   ` Peter Zijlstra
  2025-12-03 23:07 ` [PATCH v2 10/23] sched/cache: Check local_group only once in update_sg_lb_stats() Tim Chen
                   ` (14 subsequent siblings)
  23 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

During LLC load balancing, tabulate the number of tasks on each runqueue
that prefer the LLC contains the env->dst_cpu in a sched group.

For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
selected as the busiest source to pick tasks from.

Within a source LLC, the total number of tasks preferring a destination
LLC is computed by summing counts across all CPUs in that LLC. For
instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
LLC3, the total for LLC0 is 3.

These statistics allow the load balancer to choose tasks from source
sched groups that best match their preferred LLCs.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
        Convert nr_pref_llc array in sg_lb_stats to a single
        variable as only the dst LLC stat is needed.
        (K Prateek Nayak)

 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b0e87616e377..4d7803f69a74 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10445,6 +10445,9 @@ struct sg_lb_stats {
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int nr_pref_llc;
+#endif
 };
 
 /*
@@ -10912,6 +10915,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 {
 	int i, nr_running, local_group, sd_flags = env->sd->flags;
 	bool balancing_at_rd = !env->sd->parent;
+#ifdef CONFIG_SCHED_CACHE
+	int dst_llc = llc_id(env->dst_cpu);
+#endif
 
 	memset(sgs, 0, sizeof(*sgs));
 
@@ -10932,6 +10938,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if (cpu_overutilized(i))
 			*sg_overutilized = 1;
 
+#ifdef CONFIG_SCHED_CACHE
+		if (sched_cache_enabled() && llc_id(i) != dst_llc &&
+		    dst_llc >= 0)
+			sgs->nr_pref_llc += rq->nr_pref_llc[dst_llc];
+#endif
+
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 10/23] sched/cache: Check local_group only once in update_sg_lb_stats()
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (8 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 09/23] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-03 23:07 ` [PATCH v2 11/23] sched/cache: Prioritize tasks preferring destination LLC during balancing Tim Chen
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

There is no need to check the local group twice for both group_asym_packing
and group_smt_balance. Adjust the code to facilitate future checks for group
types (cache-aware load balancing) as well.

No functional changes are expected.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
       New code cleanup patch. (Peter Zijlstra)

 kernel/sched/fair.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4d7803f69a74..6e4c1ae1bdda 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10984,14 +10984,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_weight = group->group_weight;
 
-	/* Check if dst CPU is idle and preferred to this group */
-	if (!local_group && env->idle && sgs->sum_h_nr_running &&
-	    sched_group_asym(env, sgs, group))
-		sgs->group_asym_packing = 1;
-
-	/* Check for loaded SMT group to be balanced to dst CPU */
-	if (!local_group && smt_balance(env, sgs, group))
-		sgs->group_smt_balance = 1;
+	if (!local_group) {
+		/* Check if dst CPU is idle and preferred to this group */
+		if (env->idle && sgs->sum_h_nr_running &&
+		    sched_group_asym(env, sgs, group))
+			sgs->group_asym_packing = 1;
+
+		/* Check for loaded SMT group to be balanced to dst CPU */
+		if (smt_balance(env, sgs, group))
+			sgs->group_smt_balance = 1;
+	}
 
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 11/23] sched/cache: Prioritize tasks preferring destination LLC during balancing
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (9 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 10/23] sched/cache: Check local_group only once in update_sg_lb_stats() Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-03 23:07 ` [PATCH v2 12/23] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
                   ` (12 subsequent siblings)
  23 siblings, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

During LLC load balancing, first check for tasks that prefer the
destination LLC and balance them to it before others.

Mark source sched groups containing tasks preferring non local LLCs
with the group_llc_balance flag. This ensures the load balancer later
pulls or pushes these tasks toward their preferred LLCs.

The load balancer selects the busiest sched_group and migrates tasks
to less busy groups to distribute load across CPUs.

With cache-aware scheduling enabled, the busiest sched_group is
the one with most tasks preferring the destination LLC. If
the group has the llc_balance flag set, cache aware load balancing is
triggered.

Introduce the helper function update_llc_busiest() to identify the
sched_group with the most tasks preferring the destination LLC.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
       Fix comparison in can_migrate_llc(), which uses an uninitialized
       env->src_cpu. Use the candidate group's first CPU instead. (Aaron Lu)
    
       Fix a race condition during bootup with build_sched_domains(),
       where the per-cpu(sd_llc_id) is reset to -1. (lkp/0day)
       Put the set of group_llc_balance and the usage of it into
       1 patch. (Peter Zijlstra)
    
       Change group_llc_balance priority to be lower than group_overloaded
       and embed it into normal load balance path. (Peter Zijlstra)
    
       Remove the sched group's SD_SHARE_LLC check in llc_balance(), because
       we should allow tasks migration across NUMA nodes to their preferred LLC,
       where the domain does not have SD_SHARE_LLC flag.

 kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e4c1ae1bdda..db555c11b5b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9531,6 +9531,11 @@ enum group_type {
 	 * from balancing the load across the system.
 	 */
 	group_imbalanced,
+	/*
+	 * There are tasks running on non-preferred LLC, possible to move
+	 * them to their preferred LLC without creating too much imbalance.
+	 */
+	group_llc_balance,
 	/*
 	 * The CPU is overloaded and can't provide expected CPU cycles to all
 	 * tasks.
@@ -10440,6 +10445,7 @@ struct sg_lb_stats {
 	enum group_type group_type;
 	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
 	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
+	unsigned int group_llc_balance;		/* Tasks should be moved to preferred LLC */
 	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -10698,6 +10704,9 @@ group_type group_classify(unsigned int imbalance_pct,
 	if (group_is_overloaded(imbalance_pct, sgs))
 		return group_overloaded;
 
+	if (sgs->group_llc_balance)
+		return group_llc_balance;
+
 	if (sg_imbalanced(group))
 		return group_imbalanced;
 
@@ -10890,11 +10899,55 @@ static void record_sg_llc_stats(struct lb_env *env,
 	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
 		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
 }
+
+/*
+ * Do LLC balance on sched group that contains LLC, and have tasks preferring
+ * to run on LLC in idle dst_cpu.
+ */
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+			       struct sched_group *group)
+{
+	if (!sched_cache_enabled())
+		return false;
+
+	if (env->sd->flags & SD_SHARE_LLC)
+		return false;
+
+	if (sgs->nr_pref_llc &&
+	    can_migrate_llc(cpumask_first(sched_group_span(group)),
+			    env->dst_cpu, 0, true) == mig_llc)
+		return true;
+
+	return false;
+}
+
+static bool update_llc_busiest(struct lb_env *env,
+			       struct sg_lb_stats *busiest,
+			       struct sg_lb_stats *sgs)
+{
+	/*
+	 * There are more tasks that want to run on dst_cpu's LLC.
+	 */
+	return sgs->nr_pref_llc > busiest->nr_pref_llc;
+}
 #else
 static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
 				       struct sched_group *group)
 {
 }
+
+static inline bool llc_balance(struct lb_env *env, struct sg_lb_stats *sgs,
+			       struct sched_group *group)
+{
+	return false;
+}
+
+static bool update_llc_busiest(struct lb_env *env,
+			       struct sg_lb_stats *busiest,
+			       struct sg_lb_stats *sgs)
+{
+	return false;
+}
 #endif
 
 /**
@@ -10993,6 +11046,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		/* Check for loaded SMT group to be balanced to dst CPU */
 		if (smt_balance(env, sgs, group))
 			sgs->group_smt_balance = 1;
+
+		/* Check for tasks in this group can be moved to their preferred LLC */
+		if (llc_balance(env, sgs, group))
+			sgs->group_llc_balance = 1;
 	}
 
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
@@ -11056,6 +11113,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 		/* Select the overloaded group with highest avg_load. */
 		return sgs->avg_load > busiest->avg_load;
 
+	case group_llc_balance:
+		/* Select the group with most tasks preferring dst LLC */
+		return update_llc_busiest(env, busiest, sgs);
+
 	case group_imbalanced:
 		/*
 		 * Select the 1st imbalanced group as we don't have any way to
@@ -11318,6 +11379,7 @@ static bool update_pick_idlest(struct sched_group *idlest,
 			return false;
 		break;
 
+	case group_llc_balance:
 	case group_imbalanced:
 	case group_asym_packing:
 	case group_smt_balance:
@@ -11450,6 +11512,7 @@ sched_balance_find_dst_group(struct sched_domain *sd, struct task_struct *p, int
 			return NULL;
 		break;
 
+	case group_llc_balance:
 	case group_imbalanced:
 	case group_asym_packing:
 	case group_smt_balance:
@@ -11949,7 +12012,8 @@ static struct sched_group *sched_balance_find_src_group(struct lb_env *env)
 	 * group's child domain.
 	 */
 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
-	    sibling_imbalance(env, &sds, busiest, local) > 1)
+	    (busiest->group_type == group_llc_balance ||
+	    sibling_imbalance(env, &sds, busiest, local) > 1))
 		goto force_balance;
 
 	if (busiest->group_type != group_overloaded) {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 12/23] sched/cache: Add migrate_llc_task migration type for cache-aware balancing
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (10 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 11/23] sched/cache: Prioritize tasks preferring destination LLC during balancing Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-10 13:32   ` Peter Zijlstra
  2025-12-03 23:07 ` [PATCH v2 13/23] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
                   ` (11 subsequent siblings)
  23 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

Introduce a new migration type, migrate_llc_task, to support
cache-aware load balancing.

After identifying the busiest sched_group (having the most tasks
preferring the destination LLC), mark migrations with this type.
During load balancing, each runqueue in the busiest sched_group is
examined, and the runqueue with the highest number of tasks preferring
the destination CPU is selected as the busiest runqueue.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Remove unnecessary cpus_share_cache() check in
            sched_balance_find_src_rq() (K Prateek Nayak)

 kernel/sched/fair.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index db555c11b5b8..529adf342ce0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9547,7 +9547,8 @@ enum migration_type {
 	migrate_load = 0,
 	migrate_util,
 	migrate_task,
-	migrate_misfit
+	migrate_misfit,
+	migrate_llc_task
 };
 
 #define LBF_ALL_PINNED	0x01
@@ -10134,6 +10135,10 @@ static int detach_tasks(struct lb_env *env)
 			env->imbalance -= util;
 			break;
 
+		case migrate_llc_task:
+			env->imbalance--;
+			break;
+
 		case migrate_task:
 			env->imbalance--;
 			break;
@@ -11766,6 +11771,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		return;
 	}
 
+#ifdef CONFIG_SCHED_CACHE
+	if (busiest->group_type == group_llc_balance) {
+		/* Move a task that prefer local LLC */
+		env->migration_type = migrate_llc_task;
+		env->imbalance = 1;
+		return;
+	}
+#endif
+
 	if (busiest->group_type == group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -12073,6 +12087,10 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 	struct rq *busiest = NULL, *rq;
 	unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
 	unsigned int busiest_nr = 0;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned int busiest_pref_llc = 0;
+	int dst_llc;
+#endif
 	int i;
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
@@ -12181,6 +12199,16 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
 			}
 			break;
 
+		case migrate_llc_task:
+#ifdef CONFIG_SCHED_CACHE
+			dst_llc = llc_id(env->dst_cpu);
+			if (dst_llc >= 0 &&
+			    busiest_pref_llc < rq->nr_pref_llc[dst_llc]) {
+				busiest_pref_llc = rq->nr_pref_llc[dst_llc];
+				busiest = rq;
+			}
+#endif
+			break;
 		case migrate_task:
 			if (busiest_nr < nr_running) {
 				busiest_nr = nr_running;
@@ -12363,6 +12391,8 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
 	case migrate_misfit:
 		__schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
 		break;
+	case migrate_llc_task:
+		break;
 	}
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 13/23] sched/cache: Handle moving single tasks to/from their preferred LLC
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (11 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 12/23] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-03 23:07 ` [PATCH v2 14/23] sched/cache: Consider LLC preference when selecting tasks for load balancing Tim Chen
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

If the busiest runqueue has only one task, active balancing may be
invoked to move it. However, before migration, check whether the task
is running on its preferred LLC.

Do not move a lone task to another LLC if it would move the task
away from its preferred LLC or cause excessive imbalance between LLCs.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Remove uneeded preferred LLC migration check from
            active_load_balance_cpu_stop().

 kernel/sched/fair.c | 51 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 529adf342ce0..aed3fab98d7c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9878,12 +9878,57 @@ static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu
 			       task_util(p), to_pref);
 }
 
+/*
+ * Check if active load balance breaks LLC locality in
+ * terms of cache aware load balance.
+ */
+static inline bool
+break_llc_locality(struct lb_env *env)
+{
+	if (!sched_cache_enabled())
+		return false;
+
+	if (cpus_share_cache(env->src_cpu, env->dst_cpu))
+		return false;
+	/*
+	 * All tasks prefer to stay on their current CPU.
+	 * Do not pull a task from its preferred CPU if:
+	 * 1. It is the only task running there; OR
+	 * 2. Migrating it away from its preferred LLC would violate
+	 *    the cache-aware scheduling policy.
+	 */
+	if (env->src_rq->nr_pref_llc_running == env->src_rq->cfs.h_nr_runnable) {
+		unsigned long util = 0;
+		struct task_struct *cur;
+
+		if (env->src_rq->nr_running <= 1)
+			return true;
+
+		rcu_read_lock();
+		cur = rcu_dereference(env->src_rq->curr);
+		if (cur)
+			util = task_util(cur);
+		rcu_read_unlock();
+
+		if (can_migrate_llc(env->src_cpu, env->dst_cpu,
+				    util, false) == mig_forbid)
+			return true;
+	}
+
+	return false;
+}
 #else
 static inline bool get_llc_stats(int cpu, unsigned long *util,
 				 unsigned long *cap)
 {
 	return false;
 }
+
+static inline bool
+break_llc_locality(struct lb_env *env)
+{
+	return false;
+}
 #endif
 /*
  * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
@@ -12279,6 +12324,9 @@ static int need_active_balance(struct lb_env *env)
 {
 	struct sched_domain *sd = env->sd;
 
+	if (break_llc_locality(env))
+		return 0;
+
 	if (asym_active_balance(env))
 		return 1;
 
@@ -12298,7 +12346,8 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
-	if (env->migration_type == migrate_misfit)
+	if (env->migration_type == migrate_misfit ||
+	    env->migration_type == migrate_llc_task)
 		return 1;
 
 	return 0;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 14/23] sched/cache: Consider LLC preference when selecting tasks for load balancing
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (12 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 13/23] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-10 15:58   ` Peter Zijlstra
  2025-12-03 23:07 ` [PATCH v2 15/23] sched/cache: Respect LLC preference in task migration and detach Tim Chen
                   ` (9 subsequent siblings)
  23 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

Currently, task selection from the busiest runqueue ignores LLC
preferences. Reorder tasks in the busiest queue to prioritize selection
as follows:

  1. Tasks preferring the destination CPU's LLC
  2. Tasks with no LLC preference
  3. Tasks preferring an LLC different from their current one
  4. Tasks preferring the LLC they are currently on

This improves the likelihood that tasks are migrated to their
preferred LLC.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: No change.

 kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aed3fab98d7c..dd09a816670e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10092,6 +10092,68 @@ static struct task_struct *detach_one_task(struct lb_env *env)
 	return NULL;
 }
 
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Prepare lists to detach tasks in the following order:
+ * 1. tasks that prefer dst cpu's LLC
+ * 2. tasks that have no preference in LLC
+ * 3. tasks that prefer LLC other than the ones they are on
+ * 4. tasks that prefer the LLC that they are currently on.
+ */
+static struct list_head
+*order_tasks_by_llc(struct lb_env *env, struct list_head *tasks)
+{
+	struct task_struct *p;
+	LIST_HEAD(pref_old_llc);
+	LIST_HEAD(pref_new_llc);
+	LIST_HEAD(no_pref_llc);
+	LIST_HEAD(pref_other_llc);
+
+	if (!sched_cache_enabled())
+		return tasks;
+
+	if (cpus_share_cache(env->dst_cpu, env->src_cpu))
+		return tasks;
+
+	while (!list_empty(tasks)) {
+		p = list_last_entry(tasks, struct task_struct, se.group_node);
+
+		if (p->preferred_llc == llc_id(env->dst_cpu)) {
+			list_move(&p->se.group_node, &pref_new_llc);
+			continue;
+		}
+
+		if (p->preferred_llc == llc_id(env->src_cpu)) {
+			list_move(&p->se.group_node, &pref_old_llc);
+			continue;
+		}
+
+		if (p->preferred_llc == -1) {
+			list_move(&p->se.group_node, &no_pref_llc);
+			continue;
+		}
+
+		list_move(&p->se.group_node, &pref_other_llc);
+	}
+
+	/*
+	 * We detach tasks from list tail in detach tasks.  Put tasks
+	 * to be chosen first at end of list.
+	 */
+	list_splice(&pref_new_llc, tasks);
+	list_splice(&no_pref_llc, tasks);
+	list_splice(&pref_other_llc, tasks);
+	list_splice(&pref_old_llc, tasks);
+	return tasks;
+}
+#else
+static inline struct list_head
+*order_tasks_by_llc(struct lb_env *env, struct list_head *tasks)
+{
+	return tasks;
+}
+#endif
+
 /*
  * detach_tasks() -- tries to detach up to imbalance load/util/tasks from
  * busiest_rq, as part of a balancing operation within domain "sd".
@@ -10100,7 +10162,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
  */
 static int detach_tasks(struct lb_env *env)
 {
-	struct list_head *tasks = &env->src_rq->cfs_tasks;
+	struct list_head *tasks;
 	unsigned long util, load;
 	struct task_struct *p;
 	int detached = 0;
@@ -10119,6 +10181,8 @@ static int detach_tasks(struct lb_env *env)
 	if (env->imbalance <= 0)
 		return 0;
 
+	tasks = order_tasks_by_llc(env, &env->src_rq->cfs_tasks);
+
 	while (!list_empty(tasks)) {
 		/*
 		 * We don't want to steal all, otherwise we may be treated likewise,
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 15/23] sched/cache: Respect LLC preference in task migration and detach
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (13 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 14/23] sched/cache: Consider LLC preference when selecting tasks for load balancing Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-10 16:30   ` Peter Zijlstra
  2025-12-03 23:07 ` [PATCH v2 16/23] sched/cache: Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node Tim Chen
                   ` (8 subsequent siblings)
  23 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Tim Chen, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

During the final step of load balancing, can_migrate_task() now
considers a task's LLC preference before moving it out of its
preferred LLC.

Additionally, add checks in detach_tasks() to prevent selecting tasks
that prefer their current LLC.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Leave out tasks under core scheduling from the cache aware
            load balance. (K Prateek Nayak)
    
            Reduce the degree of honoring preferred_llc in detach_tasks().
            If certain conditions are met, stop migrating tasks that prefer
            their current LLC and instead continue load balancing from other
            busiest runqueues. (K Prateek Nayak)

 kernel/sched/fair.c  | 63 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h | 13 +++++++++
 2 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd09a816670e..580a967efdac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9852,8 +9852,8 @@ static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
  * Check if task p can migrate from source LLC to
  * destination LLC in terms of cache aware load balance.
  */
-static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
-							struct task_struct *p)
+static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
+					 struct task_struct *p)
 {
 	struct mm_struct *mm;
 	bool to_pref;
@@ -10025,6 +10025,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (env->flags & LBF_ACTIVE_LB)
 		return 1;
 
+#ifdef CONFIG_SCHED_CACHE
+	if (sched_cache_enabled() &&
+	    can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) == mig_forbid &&
+	    !task_has_sched_core(p))
+		return 0;
+#endif
+
 	degrades = migrate_degrades_locality(p, env);
 	if (!degrades)
 		hot = task_hot(p, env);
@@ -10146,12 +10153,55 @@ static struct list_head
 	list_splice(&pref_old_llc, tasks);
 	return tasks;
 }
+
+static bool stop_migrate_src_rq(struct task_struct *p,
+				struct lb_env *env,
+				int detached)
+{
+	if (!sched_cache_enabled() || p->preferred_llc == -1 ||
+	    cpus_share_cache(env->src_cpu, env->dst_cpu) ||
+	    env->sd->nr_balance_failed)
+		return false;
+
+	/*
+	 * Stop migration for the src_rq and pull from a
+	 * different busy runqueue in the following cases:
+	 *
+	 * 1. Trying to migrate task to its preferred
+	 *    LLC, but the chosen task does not prefer dest
+	 *    LLC - case 3 in order_tasks_by_llc(). This violates
+	 *    the goal of migrate_llc_task. However, we should
+	 *    stop detaching only if some tasks have been detached
+	 *    and the imbalance has been mitigated.
+	 *
+	 * 2. Don't detach more tasks if the remaining tasks want
+	 *    to stay. We know the remaining tasks all prefer the
+	 *    current LLC, because after order_tasks_by_llc(), the
+	 *    tasks that prefer the current LLC are the least favored
+	 *    candidates to be migrated out.
+	 */
+	if (env->migration_type == migrate_llc_task &&
+	    detached && llc_id(env->dst_cpu) != p->preferred_llc)
+		return true;
+
+	if (llc_id(env->src_cpu) == p->preferred_llc)
+		return true;
+
+	return false;
+}
 #else
 static inline struct list_head
 *order_tasks_by_llc(struct lb_env *env, struct list_head *tasks)
 {
 	return tasks;
 }
+
+static bool stop_migrate_src_rq(struct task_struct *p,
+				struct lb_env *env,
+				int detached)
+{
+	return false;
+}
 #endif
 
 /*
@@ -10205,6 +10255,15 @@ static int detach_tasks(struct lb_env *env)
 
 		p = list_last_entry(tasks, struct task_struct, se.group_node);
 
+		/*
+		 * Check if detaching current src_rq should be stopped, because
+		 * doing so would break cache aware load balance. If we stop
+		 * here, the env->flags has LBF_ALL_PINNED, which would cause
+		 * the load balance to pull from another busy runqueue.
+		 */
+		if (stop_migrate_src_rq(p, env, detached))
+			break;
+
 		if (!can_migrate_task(p, env))
 			goto next;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8f2a779825e4..40798a06e058 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1485,6 +1485,14 @@ extern void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags);
 extern void sched_core_get(void);
 extern void sched_core_put(void);
 
+static inline bool task_has_sched_core(struct task_struct *p)
+{
+	if (sched_core_disabled())
+		return false;
+
+	return !!p->core_cookie;
+}
+
 #else /* !CONFIG_SCHED_CORE: */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1524,6 +1532,11 @@ static inline bool sched_group_cookie_match(struct rq *rq,
 	return true;
 }
 
+static inline bool task_has_sched_core(struct task_struct *p)
+{
+	return false;
+}
+
 #endif /* !CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_RT_GROUP_SCHED
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 16/23] sched/cache: Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (14 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 15/23] sched/cache: Respect LLC preference in task migration and detach Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-10 16:32   ` Peter Zijlstra
  2025-12-03 23:07 ` [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling Tim Chen
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel,
	Libo Chen

From: Chen Yu <yu.c.chen@intel.com>

Cache-aware load balancing should only be enabled if there are more
than 1 LLCs within 1 NUMA node. sched_cache_present is introduced to
indicate whether this platform supports this topology.

Suggested-by: Libo Chen <libo.chen@oracle.com>
Suggested-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2:
    	Use flag sched_cache_present to indicate whether a platform
    	supports cache aware scheduling. Change this flag from staic key.
    	There should be only 1 static key to control the cache aware
    	scheduling. (Peter Zijlstra)

 kernel/sched/topology.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d583399fc6a1..9799e3a9a609 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -24,6 +24,8 @@ int max_llcs;
 
 #ifdef CONFIG_SCHED_CACHE
 
+static bool sched_cache_present;
+
 static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int **gc)
 {
 	unsigned int *new = NULL;
@@ -54,7 +56,7 @@ static void populate_new_pref_llcs(unsigned int *old, unsigned int *new)
 		new[i] = old[i];
 }
 
-static int resize_llc_pref(void)
+static int resize_llc_pref(bool has_multi_llcs)
 {
 	unsigned int *__percpu *tmp_llc_pref;
 	int i, ret = 0;
@@ -102,6 +104,11 @@ static int resize_llc_pref(void)
 		rq_unlock_irqrestore(rq, &rf);
 	}
 
+	if (has_multi_llcs) {
+		sched_cache_present = true;
+		pr_info_once("Cache aware load balance is enabled on the platform.\n");
+	}
+
 release_old:
 	/*
 	 * Load balance is done under rcu_lock.
@@ -124,7 +131,7 @@ static int resize_llc_pref(void)
 
 #else
 
-static int resize_llc_pref(void)
+static int resize_llc_pref(bool has_multi_llcs)
 {
 	max_llcs = new_max_llcs;
 	return 0;
@@ -2644,6 +2651,7 @@ static int
 build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr)
 {
 	enum s_alloc alloc_state = sa_none;
+	bool has_multi_llcs = false;
 	struct sched_domain *sd;
 	struct s_data d;
 	struct rq *rq = NULL;
@@ -2736,10 +2744,12 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 				 * between LLCs and memory channels.
 				 */
 				nr_llcs = sd->span_weight / child->span_weight;
-				if (nr_llcs == 1)
+				if (nr_llcs == 1) {
 					imb = sd->span_weight >> 3;
-				else
+				} else {
 					imb = nr_llcs;
+					has_multi_llcs = true;
+				}
 				imb = max(1U, imb);
 				sd->imb_numa_nr = imb;
 
@@ -2787,7 +2797,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	if (has_cluster)
 		static_branch_inc_cpuslocked(&sched_cluster_active);
 
-	resize_llc_pref();
+	resize_llc_pref(has_multi_llcs);
 
 	if (rq && sched_debug_verbose)
 		pr_info("root domain span: %*pbl\n", cpumask_pr_args(cpu_map));
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (15 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 16/23] sched/cache: Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-10 16:51   ` Peter Zijlstra
  2025-12-17  9:40   ` Aaron Lu
  2025-12-03 23:07 ` [PATCH v2 18/23] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
                   ` (6 subsequent siblings)
  23 siblings, 2 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

A performance regression was observed by Prateek when running hackbench
with many threads per process (high fd count). To avoid this, processes
with a large number of active threads are excluded from cache-aware
scheduling.

With sched_cache enabled, record the number of active threads in each
process during the periodic task_cache_work(). While iterating over
CPUs, if the currently running task belongs to the same process as the
task that launched task_cache_work(), increment the active thread count.

This number will be used by subsequent patch to inhibit cache aware
load balance.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: No change.

 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1ea16ef90566..04743983de4d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1043,6 +1043,7 @@ struct mm_struct {
 		raw_spinlock_t mm_sched_lock;
 		unsigned long mm_sched_epoch;
 		int mm_sched_cpu;
+		u64 nr_running_avg ____cacheline_aligned_in_smp;
 #endif
 
 #ifdef CONFIG_MMU
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 580a967efdac..2f38ad82688f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1421,11 +1421,11 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
 
 static void __no_profile task_cache_work(struct callback_head *work)
 {
-	struct task_struct *p = current;
+	struct task_struct *p = current, *cur;
 	struct mm_struct *mm = p->mm;
 	unsigned long m_a_occ = 0;
 	unsigned long curr_m_a_occ = 0;
-	int cpu, m_a_cpu = -1;
+	int cpu, m_a_cpu = -1, nr_running = 0;
 	cpumask_var_t cpus;
 
 	WARN_ON_ONCE(work != &p->cache_work);
@@ -1458,6 +1458,12 @@ static void __no_profile task_cache_work(struct callback_head *work)
 					m_occ = occ;
 					m_cpu = i;
 				}
+				rcu_read_lock();
+				cur = rcu_dereference(cpu_rq(i)->curr);
+				if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
+				    cur->mm == mm)
+					nr_running++;
+				rcu_read_unlock();
 			}
 
 			/*
@@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
 		mm->mm_sched_cpu = m_a_cpu;
 	}
 
+	update_avg(&mm->nr_running_avg, nr_running);
 	free_cpumask_var(cpus);
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 18/23] sched/cache: Disable cache aware scheduling for processes with high thread counts
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (16 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-03 23:07 ` [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

If the number of active threads within the process exceeds the number
of Cores(divided by SMTs number) in the LLC, do not enable cache-aware
scheduling. This is because there is a risk of cache contention within
the preferred LLC when too many threads are present.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: No change.

 kernel/sched/fair.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f38ad82688f..6afa3f9a4e9b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1223,6 +1223,18 @@ static int llc_id(int cpu)
 	return llc;
 }
 
+static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
+{
+	int smt_nr = 1;
+
+#ifdef CONFIG_SCHED_SMT
+	if (sched_smt_active())
+		smt_nr = cpumask_weight(cpu_smt_mask(cpu));
+#endif
+
+	return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
+}
+
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
 	int pref_llc;
@@ -1365,10 +1377,12 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 
 	/*
 	 * If this task hasn't hit task_cache_work() for a while, or it
-	 * has only 1 thread, invalidate its preferred state.
+	 * has only 1 thread, or has too many active threads, invalidate
+	 * its preferred state.
 	 */
 	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
-	    get_nr_threads(p) <= 1) {
+	    get_nr_threads(p) <= 1 ||
+	    exceed_llc_nr(mm, cpu_of(rq))) {
 		if (mm->mm_sched_cpu != -1)
 			mm->mm_sched_cpu = -1;
 	}
@@ -1435,6 +1449,13 @@ static void __no_profile task_cache_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	if (get_nr_threads(p) <= 1) {
+		if (mm->mm_sched_cpu != -1)
+			mm->mm_sched_cpu = -1;
+
+		return;
+	}
+
 	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
 		return;
 
@@ -9874,6 +9895,10 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
 
+	/* skip cache aware load balance for single/too many threads */
+	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
+		return mig_unrestricted;
+
 	if (cpus_share_cache(dst_cpu, cpu))
 		to_pref = true;
 	else if (cpus_share_cache(src_cpu, cpu))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (17 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 18/23] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-18  3:59   ` Vern Hao
  2025-12-03 23:07 ` [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling Tim Chen
                   ` (4 subsequent siblings)
  23 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.

To mitigate this, estimate a process's memory footprint by comparing
its RSS (anonymous and shared pages) to the size of the LLC. If RSS
exceeds the LLC size, skip cache-aware scheduling.

Note that RSS is only an approximation of the memory footprint.
By default, the comparison is strict, but a later patch will allow
users to provide a hint to adjust this threshold.

According to the test from Adam, some systems do not have shared L3
but with shared L2 as clusters. In this case, the L2 becomes the LLC[1].

Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@os.amperecomputing.com/

Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Assigned curr_cpu in task_cache_work() before checking
            exceed_llc_capacity(mm, curr_cpu) to avoid out-of-bound
            access.(lkp/0day)

 include/linux/cacheinfo.h | 21 ++++++++++-------
 kernel/sched/fair.c       | 49 +++++++++++++++++++++++++++++++++++----
 2 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index c8f4f0a0b874..82d0d59ca0e1 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
 
 const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf);
 
-/*
- * Get the cacheinfo structure for the cache associated with @cpu at
- * level @level.
- * cpuhp lock must be held.
- */
-static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int level)
 {
 	struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
 	int i;
 
-	lockdep_assert_cpus_held();
-
 	for (i = 0; i < ci->num_leaves; i++) {
 		if (ci->info_list[i].level == level) {
 			if (ci->info_list[i].attributes & CACHE_ID)
@@ -136,6 +129,18 @@ static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
 	return NULL;
 }
 
+/*
+ * Get the cacheinfo structure for the cache associated with @cpu at
+ * level @level.
+ * cpuhp lock must be held.
+ */
+static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+{
+	lockdep_assert_cpus_held();
+
+	return _get_cpu_cacheinfo_level(cpu, level);
+}
+
 /*
  * Get the id of the cache associated with @cpu at level @level.
  * cpuhp lock must be held.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6afa3f9a4e9b..424ec601cfdf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1223,6 +1223,38 @@ static int llc_id(int cpu)
 	return llc;
 }
 
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+{
+	struct cacheinfo *ci;
+	unsigned long rss;
+	unsigned int llc;
+
+	/*
+	 * get_cpu_cacheinfo_level() can not be used
+	 * because it requires the cpu_hotplug_lock
+	 * to be held. Use _get_cpu_cacheinfo_level()
+	 * directly because the 'cpu' can not be
+	 * offlined at the moment.
+	 */
+	ci = _get_cpu_cacheinfo_level(cpu, 3);
+	if (!ci) {
+		/*
+		 * On system without L3 but with shared L2,
+		 * L2 becomes the LLC.
+		 */
+		ci = _get_cpu_cacheinfo_level(cpu, 2);
+		if (!ci)
+			return true;
+	}
+
+	llc = ci->size;
+
+	rss = get_mm_counter(mm, MM_ANONPAGES) +
+		get_mm_counter(mm, MM_SHMEMPAGES);
+
+	return (llc <= (rss * PAGE_SIZE));
+}
+
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
 	int smt_nr = 1;
@@ -1382,7 +1414,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	 */
 	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
 	    get_nr_threads(p) <= 1 ||
-	    exceed_llc_nr(mm, cpu_of(rq))) {
+	    exceed_llc_nr(mm, cpu_of(rq)) ||
+	    exceed_llc_capacity(mm, cpu_of(rq))) {
 		if (mm->mm_sched_cpu != -1)
 			mm->mm_sched_cpu = -1;
 	}
@@ -1439,7 +1472,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
 	struct mm_struct *mm = p->mm;
 	unsigned long m_a_occ = 0;
 	unsigned long curr_m_a_occ = 0;
-	int cpu, m_a_cpu = -1, nr_running = 0;
+	int cpu, m_a_cpu = -1, nr_running = 0, curr_cpu;
 	cpumask_var_t cpus;
 
 	WARN_ON_ONCE(work != &p->cache_work);
@@ -1449,7 +1482,9 @@ static void __no_profile task_cache_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
-	if (get_nr_threads(p) <= 1) {
+	curr_cpu = task_cpu(p);
+	if (get_nr_threads(p) <= 1 ||
+	    exceed_llc_capacity(mm, curr_cpu)) {
 		if (mm->mm_sched_cpu != -1)
 			mm->mm_sched_cpu = -1;
 
@@ -9895,8 +9930,12 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
 
-	/* skip cache aware load balance for single/too many threads */
-	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
+	/*
+	 * Skip cache aware load balance for single/too many threads
+	 * or large footprint.
+	 */
+	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
+	    exceed_llc_capacity(mm, dst_cpu))
 		return mig_unrestricted;
 
 	if (cpus_share_cache(dst_cpu, cpu))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (18 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-10 17:02   ` Peter Zijlstra
                     ` (2 more replies)
  2025-12-03 23:07 ` [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing Tim Chen
                   ` (3 subsequent siblings)
  23 siblings, 3 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Introduce a set of debugfs knobs to control the enabling of
and parameters for cache-aware load balancing.

(1) llc_enabled
llc_enabled acts as the primary switch - users can toggle it to
enable or disable cache aware load balancing.

(2) llc_aggr_tolerance
With sched_cache enabled, the scheduler uses a process's RSS as a
proxy for its LLC footprint to determine if aggregating tasks on the
preferred LLC could cause cache contention. If RSS exceeds the LLC
size, aggregation is skipped. Some workloads with large RSS but small
actual memory footprints may still benefit from aggregation. Since
the kernel cannot efficiently track per-task cache usage (resctrl is
user-space only), userspace can provide a more accurate hint.

Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
users control how strictly RSS limits aggregation. Values range from
0 to 100:

  - 0: Cache-aware scheduling is disabled.
  - 1: Strict; tasks with RSS larger than LLC size are skipped.
  - 100: Aggressive; tasks are aggregated regardless of RSS.

For example, with a 32MB L3 cache:

  - llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
  - llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
    (784GB = (1 + (99 - 1) * 256) * 32MB).

Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
how strictly the number of active threads is considered when doing
cache aware load balance. The number of SMTs is also considered.
High SMT counts reduce the aggregation capacity, preventing excessive
task aggregation on SMT-heavy systems like Power10/Power11.

For example, with 8 Cores/16 CPUs in a L3:

  - llc_aggr_tolerance=1 -> tasks with nr_running > 8 are skipped.
  - llc_aggr_tolerance=99 -> tasks with nr_running > 785 are skipped
    785 = (1 + (99 - 1) * 8).

(3) llc_epoch_period/llc_epoch_affinity_timeout
Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned
into tunable.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---

Notes:
    v1->v2: Remove the smt_nr check in fits_llc_capacity().
            (Aaron Lu)

 include/linux/sched.h   |  4 ++-
 kernel/sched/debug.c    | 62 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c     | 63 ++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h    |  5 ++++
 kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++--
 5 files changed, 178 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 466ba8b7398c..95bf080bbbf0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
 DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
 
 #ifdef CONFIG_SCHED_CACHE
+DECLARE_STATIC_KEY_FALSE(sched_cache_on);
+
 static inline bool sched_cache_enabled(void)
 {
-	return false;
+	return static_branch_unlikely(&sched_cache_on);
 }
 #endif
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 02e16b70a790..cde324672103 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops = {
 	.release	= single_release,
 };
 
+#ifdef CONFIG_SCHED_CACHE
+#define SCHED_CACHE_CREATE_CONTROL(name, max)			  \
+static ssize_t sched_cache_write_##name(struct file *filp,	  \
+					const char __user *ubuf,  \
+					size_t cnt, loff_t *ppos) \
+{								  \
+	char buf[16];						  \
+	unsigned int val;					  \
+	if (cnt > 15)						  \
+		cnt = 15;					  \
+	if (copy_from_user(&buf, ubuf, cnt))			  \
+		return -EFAULT;					  \
+	buf[cnt] = '\0';					  \
+	if (kstrtouint(buf, 10, &val))				  \
+		return -EINVAL;					  \
+	if (val > (max))						  \
+		return -EINVAL;					  \
+	llc_##name = val;					  \
+	if (!strcmp(#name, "enabled"))				  \
+		sched_cache_set(false);				  \
+	*ppos += cnt;						  \
+	return cnt;						  \
+}								  \
+static int sched_cache_show_##name(struct seq_file *m, void *v)	  \
+{								  \
+	seq_printf(m, "%d\n", llc_##name);			  \
+	return 0;						  \
+}								  \
+static int sched_cache_open_##name(struct inode *inode,		  \
+				   struct file *filp)		  \
+{								  \
+	return single_open(filp, sched_cache_show_##name, NULL);  \
+}								  \
+static const struct file_operations sched_cache_fops_##name = {	  \
+	.open		= sched_cache_open_##name,		  \
+	.write		= sched_cache_write_##name,		  \
+	.read		= seq_read,				  \
+	.llseek		= seq_lseek,				  \
+	.release	= single_release,			  \
+}
+
+SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
+SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
+SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
+SCHED_CACHE_CREATE_CONTROL(enabled, 1);
+#endif /* SCHED_CACHE */
+
 static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf,
 				   size_t cnt, loff_t *ppos)
 {
@@ -523,6 +570,21 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_SCHED_CACHE
+	debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
+			    &sched_cache_fops_overload_pct);
+	debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
+			    &sched_cache_fops_imb_pct);
+	debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
+			    &sched_cache_fops_aggr_tolerance);
+	debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
+			    &sched_cache_fops_enabled);
+	debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
+			   &llc_epoch_period);
+	debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
+			   &llc_epoch_affinity_timeout);
+#endif
+
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
 	debugfs_fair_server_init();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 424ec601cfdf..a2e2d6742481 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 
 __read_mostly unsigned int llc_overload_pct       = 50;
 __read_mostly unsigned int llc_imb_pct            = 20;
+__read_mostly unsigned int llc_aggr_tolerance     = 1;
+__read_mostly unsigned int llc_epoch_period       = EPOCH_PERIOD;
+__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
 
 static int llc_id(int cpu)
 {
@@ -1223,11 +1226,22 @@ static int llc_id(int cpu)
 	return llc;
 }
 
+static inline int get_sched_cache_scale(int mul)
+{
+	if (!llc_aggr_tolerance)
+		return 0;
+
+	if (llc_aggr_tolerance == 100)
+		return INT_MAX;
+
+	return (1 + (llc_aggr_tolerance - 1) * mul);
+}
+
 static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 {
+	unsigned int llc, scale;
 	struct cacheinfo *ci;
 	unsigned long rss;
-	unsigned int llc;
 
 	/*
 	 * get_cpu_cacheinfo_level() can not be used
@@ -1252,19 +1266,54 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 	rss = get_mm_counter(mm, MM_ANONPAGES) +
 		get_mm_counter(mm, MM_SHMEMPAGES);
 
-	return (llc <= (rss * PAGE_SIZE));
+	/*
+	 * Scale the LLC size by 256*llc_aggr_tolerance
+	 * and compare it to the task's RSS size.
+	 *
+	 * Suppose the L3 size is 32MB. If the
+	 * llc_aggr_tolerance is 1:
+	 * When the RSS is larger than 32MB, the process
+	 * is regarded as exceeding the LLC capacity. If
+	 * the llc_aggr_tolerance is 99:
+	 * When the RSS is larger than 784GB, the process
+	 * is regarded as exceeding the LLC capacity because:
+	 * 784GB = (1 + (99 - 1) * 256) * 32MB
+	 */
+	scale = get_sched_cache_scale(256);
+	if (scale == INT_MAX)
+		return false;
+
+	return ((llc * scale) <= (rss * PAGE_SIZE));
 }
 
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
-	int smt_nr = 1;
+	int smt_nr = 1, scale;
 
 #ifdef CONFIG_SCHED_SMT
 	if (sched_smt_active())
 		smt_nr = cpumask_weight(cpu_smt_mask(cpu));
 #endif
+	/*
+	 * Scale the Core number in a LLC by llc_aggr_tolerance
+	 * and compare it to the task's active threads.
+	 *
+	 * Suppose the number of Cores in LLC is 8.
+	 * Every core has 2 SMTs.
+	 * If the llc_aggr_tolerance is 1: When the
+	 * nr_running is larger than 8, the process
+	 * is regarded as exceeding the LLC capacity.
+	 * If the llc_aggr_tolerance is 99:
+	 * When the nr_running is larger than 785,
+	 * the process is regarded as exceeding
+	 * the LLC capacity:
+	 * 785 = 1 + (99 - 1) * 8
+	 */
+	scale = get_sched_cache_scale(1);
+	if (scale == INT_MAX)
+		return false;
 
-	return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
+	return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu)));
 }
 
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
@@ -1350,9 +1399,9 @@ static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
 	long delta = now - rq->cpu_epoch_next;
 
 	if (delta > 0) {
-		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		n = (delta + llc_epoch_period - 1) / llc_epoch_period;
 		rq->cpu_epoch += n;
-		rq->cpu_epoch_next += n * EPOCH_PERIOD;
+		rq->cpu_epoch_next += n * llc_epoch_period;
 		__shr_u64(&rq->cpu_runtime, n);
 	}
 
@@ -1412,7 +1461,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	 * has only 1 thread, or has too many active threads, invalidate
 	 * its preferred state.
 	 */
-	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
+	if (epoch - READ_ONCE(mm->mm_sched_epoch) > llc_epoch_affinity_timeout ||
 	    get_nr_threads(p) <= 1 ||
 	    exceed_llc_nr(mm, cpu_of(rq)) ||
 	    exceed_llc_capacity(mm, cpu_of(rq))) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 40798a06e058..15d126bd3728 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2852,6 +2852,11 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
 #ifdef CONFIG_SCHED_CACHE
 extern unsigned int llc_overload_pct;
 extern unsigned int llc_imb_pct;
+extern unsigned int llc_aggr_tolerance;
+extern unsigned int llc_epoch_period;
+extern unsigned int llc_epoch_affinity_timeout;
+extern unsigned int llc_enabled;
+void sched_cache_set(bool locked);
 #endif
 
 #ifdef CONFIG_SCHED_HRTICK
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9799e3a9a609..818599ddaaef 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -26,6 +26,49 @@ int max_llcs;
 
 static bool sched_cache_present;
 
+unsigned int llc_enabled = 1;
+DEFINE_STATIC_KEY_FALSE(sched_cache_on);
+
+/*
+ * Enable/disable cache aware scheduling according to
+ * user input and the presence of hardware support.
+ */
+static void _sched_cache_set(bool enable, bool locked)
+{
+	if (enable) {
+		if (locked)
+			static_branch_enable_cpuslocked(&sched_cache_on);
+		else
+			static_branch_enable(&sched_cache_on);
+	} else {
+		if (locked)
+			static_branch_disable_cpuslocked(&sched_cache_on);
+		else
+			static_branch_disable(&sched_cache_on);
+	}
+}
+
+void sched_cache_set(bool locked)
+{
+	/* hardware does not support */
+	if (!sched_cache_present) {
+		if (static_branch_likely(&sched_cache_on))
+			_sched_cache_set(false, locked);
+
+		return;
+	}
+
+	/* user wants it or not ?*/
+	if (llc_enabled) {
+		if (!static_branch_likely(&sched_cache_on))
+			_sched_cache_set(true, locked);
+
+	} else {
+		if (static_branch_likely(&sched_cache_on))
+			_sched_cache_set(false, locked);
+	}
+}
+
 static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int **gc)
 {
 	unsigned int *new = NULL;
@@ -70,8 +113,12 @@ static int resize_llc_pref(bool has_multi_llcs)
 	 * new buffer.
 	 */
 	tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
-	if (!tmp_llc_pref)
-		return -ENOMEM;
+	if (!tmp_llc_pref) {
+		sched_cache_present = false;
+		ret = -ENOMEM;
+
+		goto out;
+	}
 
 	for_each_present_cpu(i)
 		*per_cpu_ptr(tmp_llc_pref, i) = NULL;
@@ -89,6 +136,7 @@ static int resize_llc_pref(bool has_multi_llcs)
 		new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
 		if (!new) {
 			ret = -ENOMEM;
+			sched_cache_present = false;
 
 			goto release_old;
 		}
@@ -126,6 +174,8 @@ static int resize_llc_pref(bool has_multi_llcs)
 	if (!ret)
 		max_llcs = new_max_llcs;
 
+out:
+	sched_cache_set(true);
 	return ret;
 }
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (19 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-19  5:03   ` Yangyu Chen
  2025-12-03 23:07 ` [PATCH v2 22/23] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Tim Chen
                   ` (2 subsequent siblings)
  23 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Debug patch only.

With cache-aware load balancing enabled, statistics related to its activity
are exposed via /proc/schedstat and debugfs. For instance, if users want to
verify metrics like the number of exceeding RSS and nr_running limits, they
can filter the output of /sys/kernel/debug/sched/debug and compute the required
statistics manually:

llc_exceed_cap SUM: 6
llc_exceed_nr SUM: 4531

Furthermore, these statistics exposed in /proc/schedstats can be queried manually
or via perf sched stats[1] with minor modifications.

Link: https://lore.kernel.org/all/20250909114227.58802-1-swapnil.sapkal@amd.com #1

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/sched/topology.h | 1 +
 kernel/sched/fair.c            | 1 +
 kernel/sched/stats.c           | 5 +++--
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 0ba4697d74ba..8702c1e731a0 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -108,6 +108,7 @@ struct sched_domain {
 	unsigned int lb_imbalance_util[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_imbalance_task[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_imbalance_misfit[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_imbalance_llc[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a2e2d6742481..742e455b093e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12684,6 +12684,7 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
 		__schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
 		break;
 	case migrate_llc_task:
+		__schedstat_add(sd->lb_imbalance_llc[idle], env->imbalance);
 		break;
 	}
 }
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index d1c9429a4ac5..3736f6102261 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -104,7 +104,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
  * Bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 17
+#define SCHEDSTAT_VERSION 18
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -139,7 +139,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 			seq_printf(seq, "domain%d %s %*pb", dcount++, sd->name,
 				   cpumask_pr_args(sched_domain_span(sd)));
 			for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
-				seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u",
+				seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u %u",
 				    sd->lb_count[itype],
 				    sd->lb_balanced[itype],
 				    sd->lb_failed[itype],
@@ -147,6 +147,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 				    sd->lb_imbalance_util[itype],
 				    sd->lb_imbalance_task[itype],
 				    sd->lb_imbalance_misfit[itype],
+				    sd->lb_imbalance_llc[itype],
 				    sd->lb_gained[itype],
 				    sd->lb_hot_gained[itype],
 				    sd->lb_nobusyq[itype],
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 22/23] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (20 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-03 23:07 ` [PATCH v2 23/23] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Tim Chen
  2025-12-19  3:19 ` [PATCH v2 00/23] Cache aware scheduling Aaron Lu
  23 siblings, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Debug patch only.

The user leverages this trace event (via bpftrace, etc)to monitor the cache
aware load balance activity - whether the tasks are moved to their preferred
LLC, or moved out of their preferred LLC.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/trace/events/sched.h | 31 +++++++++++++++++++++++++++++++
 kernel/sched/fair.c          | 10 ++++++++++
 2 files changed, 41 insertions(+)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 7b2645b50e78..bd03f49f7e3c 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -10,6 +10,37 @@
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
 
+TRACE_EVENT(sched_attach_task,
+
+	TP_PROTO(struct task_struct *t, int pref_cpu, int pref_llc,
+		 int attach_cpu, int attach_llc),
+
+	TP_ARGS(t, pref_cpu, pref_llc, attach_cpu, attach_llc),
+
+	TP_STRUCT__entry(
+			__array(	char,	comm,	TASK_COMM_LEN	)
+			__field(	pid_t,	pid			)
+			__field(	int,	pref_cpu		)
+			__field(	int,	pref_llc		)
+			__field(	int,	attach_cpu		)
+			__field(	int,	attach_llc		)
+	),
+
+	TP_fast_assign(
+		      memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
+		      __entry->pid	= t->pid;
+		      __entry->pref_cpu	= pref_cpu;
+		      __entry->pref_llc	= pref_llc;
+		      __entry->attach_cpu	= attach_cpu;
+		      __entry->attach_llc	= attach_llc;
+	),
+
+	TP_printk("comm=%s pid=%d pref_cpu=%d pref_llc=%d attach_cpu=%d attach_llc=%d",
+		  __entry->comm, __entry->pid,
+		  __entry->pref_cpu, __entry->pref_llc,
+		  __entry->attach_cpu, __entry->attach_llc)
+);
+
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
  */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 742e455b093e..e47b4096f0a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10487,6 +10487,16 @@ static void attach_task(struct rq *rq, struct task_struct *p)
 {
 	lockdep_assert_rq_held(rq);
 
+#ifdef CONFIG_SCHED_CACHE
+	if (p->mm) {
+		int pref_cpu = p->mm->mm_sched_cpu;
+
+		trace_sched_attach_task(p,
+					pref_cpu,
+					pref_cpu != -1 ? llc_id(pref_cpu) : -1,
+					cpu_of(rq), llc_id(cpu_of(rq)));
+	}
+#endif
 	WARN_ON_ONCE(task_rq(p) != rq);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
 	wakeup_preempt(rq, p, 0);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* [PATCH v2 23/23] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (21 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 22/23] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Tim Chen
@ 2025-12-03 23:07 ` Tim Chen
  2025-12-17  9:59   ` Aaron Lu
  2025-12-19  3:19 ` [PATCH v2 00/23] Cache aware scheduling Aaron Lu
  23 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-03 23:07 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
	Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

From: Chen Yu <yu.c.chen@intel.com>

Debug patch only.

Show the per-LLC occupancy in /proc/{PID}/schedstat, with each column
corresponding to one LLC. This can be used to verify if the cache-aware
load balancer works as expected by aggregating threads onto dedicated LLCs.

Suppose there are 2 LLCs and the sampling duration is 10 seconds:

Enable the cache aware load balance:
0 12281  <--- LLC0 residency delta is 0, LLC1 is 12 seconds
0 18881
0 16217

disable the cache aware load balance:
6497 15802
9299 5435
17811 8278

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 fs/proc/base.c           | 22 ++++++++++++++++++++++
 include/linux/mm_types.h | 19 +++++++++++++++++--
 include/linux/sched.h    |  3 +++
 kernel/sched/fair.c      | 40 ++++++++++++++++++++++++++++++++++++++--
 4 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6299878e3d97..f4be96f4bd01 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -518,6 +518,28 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
 		   (unsigned long long)task->se.sum_exec_runtime,
 		   (unsigned long long)task->sched_info.run_delay,
 		   task->sched_info.pcount);
+#ifdef CONFIG_SCHED_CACHE
+	if (sched_cache_enabled()) {
+		struct mm_struct *mm = task->mm;
+		u64 *llc_runtime;
+
+		if (!mm)
+			return 0;
+
+		llc_runtime = kcalloc(max_llcs, sizeof(u64), GFP_KERNEL);
+		if (!llc_runtime)
+			return 0;
+
+		if (get_mm_per_llc_runtime(task, llc_runtime))
+			goto out;
+
+		for (int i = 0; i < max_llcs; i++)
+			seq_printf(m, "%llu ", llc_runtime[i]);
+		seq_puts(m, "\n");
+out:
+		kfree(llc_runtime);
+	}
+#endif
 
 	return 0;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 04743983de4d..255c22be7312 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -944,6 +944,10 @@ struct mm_sched {
 	unsigned long epoch;
 };
 
+struct mm_time {
+	u64 runtime_ns;
+};
+
 struct kioctx_table;
 struct iommu_mm_data;
 struct mm_struct {
@@ -1040,6 +1044,7 @@ struct mm_struct {
 		 * See account_mm_sched() and ...
 		 */
 		struct mm_sched __percpu *pcpu_sched;
+		struct mm_time __percpu *pcpu_time;
 		raw_spinlock_t mm_sched_lock;
 		unsigned long mm_sched_epoch;
 		int mm_sched_cpu;
@@ -1505,16 +1510,24 @@ static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumas
 #endif /* CONFIG_SCHED_MM_CID */
 
 #ifdef CONFIG_SCHED_CACHE
-void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sched);
+void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sched,
+		   struct mm_time __percpu *pcpu_time);
 
 static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
 {
 	struct mm_sched __percpu *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
+	struct mm_time __percpu *pcpu_time;
 
 	if (!pcpu_sched)
 		return -ENOMEM;
 
-	mm_init_sched(mm, pcpu_sched);
+	pcpu_time = alloc_percpu_noprof(struct mm_time);
+	if (!pcpu_time) {
+		free_percpu(mm->pcpu_sched);
+		return -ENOMEM;
+	}
+
+	mm_init_sched(mm, pcpu_sched, pcpu_time);
 	return 0;
 }
 
@@ -1523,7 +1536,9 @@ static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
 static inline void mm_destroy_sched(struct mm_struct *mm)
 {
 	free_percpu(mm->pcpu_sched);
+	free_percpu(mm->pcpu_time);
 	mm->pcpu_sched = NULL;
+	mm->pcpu_time = NULL;
 }
 #else /* !CONFIG_SCHED_CACHE */
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 95bf080bbbf0..875ac3f4208b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2442,6 +2442,9 @@ static inline bool sched_cache_enabled(void)
 {
 	return static_branch_unlikely(&sched_cache_on);
 }
+
+int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf);
+extern int max_llcs;
 #endif
 
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e47b4096f0a6..205208f061bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1355,16 +1355,19 @@ static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
 	p->sched_llc_active = false;
 }
 
-void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
+void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched,
+		   struct mm_time __percpu *_pcpu_time)
 {
 	unsigned long epoch;
 	int i;
 
 	for_each_possible_cpu(i) {
 		struct mm_sched *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
+		struct mm_time *pcpu_time = per_cpu_ptr(_pcpu_time, i);
 		struct rq *rq = cpu_rq(i);
 
 		pcpu_sched->runtime = 0;
+		pcpu_time->runtime_ns = 0;
 		pcpu_sched->epoch = rq->cpu_epoch;
 		epoch = rq->cpu_epoch;
 	}
@@ -1379,6 +1382,8 @@ void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
 	 * the readers may get invalid mm_sched_epoch, etc.
 	 */
 	smp_store_release(&mm->pcpu_sched, _pcpu_sched);
+	/* same as above */
+	smp_store_release(&mm->pcpu_time, _pcpu_time);
 }
 
 /* because why would C be fully specified */
@@ -1428,11 +1433,39 @@ static unsigned long __no_profile fraction_mm_sched(struct rq *rq, struct mm_sch
 
 static unsigned int task_running_on_cpu(int cpu, struct task_struct *p);
 
+/* p->pi_lock is hold */
+int get_mm_per_llc_runtime(struct task_struct *p, u64 *buf)
+{
+	struct mm_struct *mm = p->mm;
+	struct mm_time *pcpu_time;
+	int cpu;
+
+	if (!mm)
+		return -EINVAL;
+
+	rcu_read_lock();
+	for_each_online_cpu(cpu) {
+		int llc = llc_id(cpu);
+		u64 runtime_ms;
+
+		if (llc < 0)
+			continue;
+
+		pcpu_time = per_cpu_ptr(mm->pcpu_time, cpu);
+		runtime_ms = div_u64(pcpu_time->runtime_ns, NSEC_PER_MSEC);
+		buf[llc] += runtime_ms;
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+
 static inline
 void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
 	struct mm_struct *mm = p->mm;
 	struct mm_sched *pcpu_sched;
+	struct mm_time *pcpu_time;
 	unsigned long epoch;
 	int mm_sched_llc = -1;
 
@@ -1444,14 +1477,17 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	/*
 	 * init_task and kthreads don't having mm
 	 */
-	if (!mm || !mm->pcpu_sched)
+	if (!mm || !mm->pcpu_sched || !mm->pcpu_time)
 		return;
 
 	pcpu_sched = per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq));
+	pcpu_time = per_cpu_ptr(p->mm->pcpu_time, cpu_of(rq));
 
 	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
 		__update_mm_sched(rq, pcpu_sched);
 		pcpu_sched->runtime += delta_exec;
+		/* pure runtime without decay */
+		pcpu_time->runtime_ns += delta_exec;
 		rq->cpu_runtime += delta_exec;
 		epoch = rq->cpu_epoch;
 	}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing
  2025-12-03 23:07 ` [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
@ 2025-12-09 11:12   ` Peter Zijlstra
  2025-12-09 21:39     ` Tim Chen
  2025-12-10  9:37   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-09 11:12 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:20PM -0800, Tim Chen wrote:

>        Minor fix in task_tick_cache() to use
>        if (mm->mm_sched_epoch >= rq->cpu_epoch)
>        to avoid mm_sched_epoch going backwards.

> +static void task_tick_cache(struct rq *rq, struct task_struct *p)
> +{
> +	struct callback_head *work = &p->cache_work;
> +	struct mm_struct *mm = p->mm;
> +
> +	if (!sched_cache_enabled())
> +		return;
> +
> +	if (!mm || !mm->pcpu_sched)
> +		return;
> +
> +	/* avoid moving backwards */
> +	if (mm->mm_sched_epoch >= rq->cpu_epoch)
> +		return;

IIRC this was supposed to be able to wrap; which then means you should
write it like:

	if ((mm->mm_sched_epoch - rq->cpu_epoch) >= 0)
		return;

or somesuch.

> +
> +	guard(raw_spinlock)(&mm->mm_sched_lock);
> +
> +	if (work->next == work) {
> +		task_work_add(p, work, TWA_RESUME);
> +		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
> +	}
> +}

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 02/23] sched/cache: Record per-LLC utilization to guide cache-aware scheduling decisions
  2025-12-03 23:07 ` [PATCH v2 02/23] sched/cache: Record per-LLC utilization to guide cache-aware scheduling decisions Tim Chen
@ 2025-12-09 11:21   ` Peter Zijlstra
  2025-12-10 14:02     ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-09 11:21 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:21PM -0800, Tim Chen wrote:

> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index bbcfdf12aa6e..0ba4697d74ba 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -68,6 +68,10 @@ struct sched_domain_shared {
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
>  	int		nr_idle_scan;
> +#ifdef CONFIG_SCHED_CACHE
> +	unsigned long	util_avg;
> +	unsigned long	capacity ____cacheline_aligned_in_smp;

This cacheline annotation confuses me, see below.

> +#endif
>  };
>  
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cb82f558dc5b..b9f336300f14 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9622,6 +9622,29 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
>  	return 0;
>  }
>  
> +#ifdef CONFIG_SCHED_CACHE
> +/* Called from load balancing paths with rcu_read_lock held */
> +static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
> +					 unsigned long *cap)
> +{
> +	struct sched_domain_shared *sd_share;
> +
> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> +	if (!sd_share)
> +		return false;
> +
> +	*util = READ_ONCE(sd_share->util_avg);
> +	*cap = READ_ONCE(sd_share->capacity);

You placed capacity on a separate line, forcing the above to be 2
distinct lines. That seems... sub-optimal?

> +
> +	return true;
> +}
> +#else
> +static inline bool get_llc_stats(int cpu, unsigned long *util,
> +				 unsigned long *cap)
> +{
> +	return false;
> +}
> +#endif
>  /*
>   * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
>   */
> @@ -10592,6 +10615,51 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>  	return check_cpu_capacity(rq, sd);
>  }
>  
> +#ifdef CONFIG_SCHED_CACHE
> +/*
> + * Record the statistics for this scheduler group for later
> + * use. These values guide load balancing on aggregating tasks
> + * to a LLC.
> + */
> +static void record_sg_llc_stats(struct lb_env *env,
> +				struct sg_lb_stats *sgs,
> +				struct sched_group *group)
> +{
> +	struct sched_domain_shared *sd_share;
> +
> +	if (!sched_cache_enabled() || env->idle == CPU_NEWLY_IDLE)
> +		return;
> +
> +	/* Only care about sched domain spanning multiple LLCs */
> +	if (env->sd->child != rcu_dereference(per_cpu(sd_llc, env->dst_cpu)))
> +		return;
> +
> +	/*
> +	 * At this point we know this group spans a LLC domain.
> +	 * Record the statistic of this group in its corresponding
> +	 * shared LLC domain.
> +	 * Note: sd_share cannot be obtained via sd->child->shared, because
> +	 * it refers to the domain that covers the local group, while
> +	 * sd_share could represent any of the LLC group.
> +	 */
> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared,
> +					   cpumask_first(sched_group_span(group))));
> +	if (!sd_share)
> +		return;
> +
> +	if (READ_ONCE(sd_share->util_avg) != sgs->group_util)
> +		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
> +
> +	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
> +		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);

And same here.

> +}
> +#else
> +static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
> +				       struct sched_group *group)
> +{
> +}
> +#endif

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 04/23] sched/cache: Make LLC id continuous
  2025-12-03 23:07 ` [PATCH v2 04/23] sched/cache: Make LLC id continuous Tim Chen
@ 2025-12-09 11:58   ` Peter Zijlstra
  2025-12-15 20:49     ` Tim Chen
  2025-12-23  5:31   ` K Prateek Nayak
  1 sibling, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-09 11:58 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:23PM -0800, Tim Chen wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 710ed9943d27..0a3918269906 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1210,10 +1210,17 @@ __read_mostly unsigned int llc_imb_pct            = 20;
>  
>  static int llc_id(int cpu)
>  {
> +	int llc;
> +
>  	if (cpu < 0)
>  		return -1;
>  
> +	llc = per_cpu(sd_llc_id, cpu);
> +	/* avoid race with cpu hotplug */
> +	if (unlikely(llc >= max_llcs))
> +		return -1;
> +
> +	return llc;
>  }
>  
>  void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)

> @@ -668,6 +670,55 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>  DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
>  DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
>  
> +/*
> + * Assign continuous llc id for the CPU, and return
> + * the assigned llc id.
> + */
> +static int update_llc_id(struct sched_domain *sd,
> +			 int cpu)
> +{
> +	int id = per_cpu(sd_llc_id, cpu), i;
> +
> +	if (id >= 0)
> +		return id;
> +
> +	if (sd) {
> +		/* Look for any assigned id and reuse it.*/
> +		for_each_cpu(i, sched_domain_span(sd)) {
> +			id = per_cpu(sd_llc_id, i);
> +
> +			if (id >= 0) {
> +				per_cpu(sd_llc_id, cpu) = id;
> +				return id;
> +			}
> +		}
> +	}
> +
> +	/*
> +	 * When 1. there is no id assigned to this LLC domain,
> +	 * or 2. the sd is NULL, we reach here.
> +	 * Consider the following scenario,
> +	 * CPU0~CPU95 are in the node0, CPU96~CPU191 are
> +	 * in the node1. During bootup, maxcpus=96 is
> +	 * appended.
> +	 * case 1: When running cpu_attach_domain(CPU24)
> +	 * during boot up, CPU24 is the first CPU in its
> +	 * non-NULL LLC domain. However,
> +	 * its corresponding llc id has not been assigned yet.
> +	 *
> +	 * case 2: After boot up, the CPU100 is brought up
> +	 * via sysfs manually. As a result, CPU100 has only a
> +	 * Numa domain attached, because CPU100 is the only CPU
> +	 * of a sched domain, all its bottom domains are degenerated.
> +	 * The LLC domain pointer sd is NULL for CPU100.
> +	 *
> +	 * For both cases, we want to increase the number of LLCs.
> +	 */
> +	per_cpu(sd_llc_id, cpu) = max_llcs++;
> +
> +	return per_cpu(sd_llc_id, cpu);
> +}

I'm not sure I follow. So partition_sched_domains() first calls
detach_destroy_domains() on the old set, and then build_sched_domains()
on the new set.

Do detach_destroy_domain() will do:

  cpu_attach_domain(NULL,..);

That is, it will explicitly attach the NULL sched_domain to a CPU. At
which point I feel update_llc_id() should be returning -1, no?

Then later, build_sched_domains() will set a !NULL sched_domain, at
which point update_llc_id() can set a real value.

This should then also get rid of that weird max_llcs check in llc_id(),
right?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes
  2025-12-03 23:07 ` [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes Tim Chen
@ 2025-12-09 12:11   ` Peter Zijlstra
  2025-12-09 22:34     ` Tim Chen
  2025-12-12  3:34   ` Vern Hao
  1 sibling, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-09 12:11 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:24PM -0800, Tim Chen wrote:
> With cache-aware scheduling enabled, each task is assigned a
> preferred LLC ID. This allows quick identification of the LLC domain
> where the task prefers to run, similar to numa_preferred_nid in
> NUMA balancing.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a3918269906..10cec83f65d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1300,6 +1300,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>  	struct mm_struct *mm = p->mm;
>  	struct mm_sched *pcpu_sched;
>  	unsigned long epoch;
> +	int mm_sched_llc = -1;
>  
>  	if (!sched_cache_enabled())
>  		return;
> @@ -1330,6 +1331,23 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>  		if (mm->mm_sched_cpu != -1)
>  			mm->mm_sched_cpu = -1;
>  	}
> +
> +	if (mm->mm_sched_cpu != -1) {
> +		mm_sched_llc = llc_id(mm->mm_sched_cpu);
> +
> +#ifdef CONFIG_NUMA_BALANCING
> +		/*
> +		 * Don't assign preferred LLC if it
> +		 * conflicts with NUMA balancing.
> +		 */
> +		if (p->numa_preferred_nid >= 0 &&
> +		    cpu_to_node(mm->mm_sched_cpu) != p->numa_preferred_nid)
> +			mm_sched_llc = -1;
> +#endif
> +	}
> +
> +	if (p->preferred_llc != mm_sched_llc)
> +		p->preferred_llc = mm_sched_llc;
>  }

This can of course still happen when sched_setnuma() gets called. I'm
thinking it is not much of an issue because we expect this thing to get
called fairly regularly -- at a higher rate than sched_setnuma() at
least -- and thus the conflict only exists for a short period of time?

If so, that would make for a good comment.

Additionally, we could of course search for the busiest LLC inside the
node, instead of setting -1. Again, that could live as a comment for
future work.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue
  2025-12-03 23:07 ` [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
@ 2025-12-09 12:16   ` Peter Zijlstra
  2025-12-09 22:55     ` Tim Chen
  2025-12-17 10:04   ` Vern Hao
  1 sibling, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-09 12:16 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:25PM -0800, Tim Chen wrote:


>  #ifdef CONFIG_SCHED_CACHE
>  	struct callback_head		cache_work;
> +	/*the p is currently refcounted in a rq's preferred llc stats*/

Shall we have spaces after and before the comment marks?

Also, comment confuses me, I don't see get_task_struct() /
put_task_struct() usage. Did you mean something else with refcount?

> +	bool				sched_llc_active;
>  	int				preferred_llc;
>  #endif

> +static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
> +{
> +	int pref_llc;
> +
> +	/*
> +	 * Borrow the uc_se->active from uclamp_rq_inc_id(),
> +	 * uclamp_rq_dec_id() to avoid the unbalanced calculation
> +	 * of rq statistics.
> +	 */
> +	if (unlikely(!p->sched_llc_active))
> +		return;

Another very confusing comment; what? Also, can you please explain (in
the new comment) how we get here without having llc_active set?

> +
> +	pref_llc = p->preferred_llc;
> +	if (pref_llc < 0)
> +		return;
> +
> +	rq->nr_llc_running--;
> +	rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
> +	p->sched_llc_active = false;
> +}

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter
  2025-12-03 23:07 ` [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter Tim Chen
@ 2025-12-09 13:06   ` Peter Zijlstra
  2025-12-09 23:17     ` Tim Chen
  2025-12-10 12:43   ` Peter Zijlstra
  2025-12-10 12:51   ` Peter Zijlstra
  2 siblings, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-09 13:06 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:26PM -0800, Tim Chen wrote:

> +#ifdef CONFIG_SCHED_CACHE
> +
> +static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int **gc)
> +{
> +	unsigned int *new = NULL;
> +
> +	new = kcalloc(new_max_llcs, sizeof(unsigned int),
> +		      GFP_KERNEL | __GFP_NOWARN);
> +
> +	if (!new) {
> +		*gc = NULL;
> +	} else {
> +		/*
> +		 * Place old entry in garbage collector
> +		 * for later disposal.
> +		 */
> +		*gc = old;
> +	}
> +	return new;
> +}
> +
> +static void populate_new_pref_llcs(unsigned int *old, unsigned int *new)
> +{
> +	int i;
> +
> +	if (!old)
> +		return;
> +
> +	for (i = 0; i < max_llcs; i++)
> +		new[i] = old[i];
> +}
> +
> +static int resize_llc_pref(void)
> +{
> +	unsigned int *__percpu *tmp_llc_pref;
> +	int i, ret = 0;
> +
> +	if (new_max_llcs <= max_llcs)
> +		return 0;
> +
> +	/*
> +	 * Allocate temp percpu pointer for old llc_pref,
> +	 * which will be released after switching to the
> +	 * new buffer.
> +	 */
> +	tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
> +	if (!tmp_llc_pref)
> +		return -ENOMEM;
> +
> +	for_each_present_cpu(i)
> +		*per_cpu_ptr(tmp_llc_pref, i) = NULL;
> +
> +	/*
> +	 * Resize the per rq nr_pref_llc buffer and
> +	 * switch to this new buffer.
> +	 */
> +	for_each_present_cpu(i) {
> +		struct rq_flags rf;
> +		unsigned int *new;
> +		struct rq *rq;
> +
> +		rq = cpu_rq(i);
> +		new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
> +		if (!new) {
> +			ret = -ENOMEM;
> +
> +			goto release_old;
> +		}
> +
> +		/*
> +		 * Locking rq ensures that rq->nr_pref_llc values
> +		 * don't change with new task enqueue/dequeue
> +		 * when we repopulate the newly enlarged array.
> +		 */

		guard(rq_lock_irq)(rq);

Notably, this cannot be with IRQs disabled, as you're doing allocations.

> +		rq_lock_irqsave(rq, &rf);
> +		populate_new_pref_llcs(rq->nr_pref_llc, new);
> +		rq->nr_pref_llc = new;
> +		rq_unlock_irqrestore(rq, &rf);
> +	}
> +
> +release_old:
> +	/*
> +	 * Load balance is done under rcu_lock.
> +	 * Wait for load balance before and during resizing to
> +	 * be done. They may refer to old nr_pref_llc[]
> +	 * that hasn't been resized.
> +	 */
> +	synchronize_rcu();
> +	for_each_present_cpu(i)
> +		kfree(*per_cpu_ptr(tmp_llc_pref, i));
> +
> +	free_percpu(tmp_llc_pref);
> +
> +	/* succeed and update */
> +	if (!ret)
> +		max_llcs = new_max_llcs;
> +
> +	return ret;
> +}

I think you need at least cpus_read_lock(), because present_cpu is
dynamic -- but I'm not quite sure what lock is used to serialize it.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing
  2025-12-09 11:12   ` Peter Zijlstra
@ 2025-12-09 21:39     ` Tim Chen
  0 siblings, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-09 21:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Tue, 2025-12-09 at 12:12 +0100, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:20PM -0800, Tim Chen wrote:
> 
> >        Minor fix in task_tick_cache() to use
> >        if (mm->mm_sched_epoch >= rq->cpu_epoch)
> >        to avoid mm_sched_epoch going backwards.
> 
> > +static void task_tick_cache(struct rq *rq, struct task_struct *p)
> > +{
> > +	struct callback_head *work = &p->cache_work;
> > +	struct mm_struct *mm = p->mm;
> > +
> > +	if (!sched_cache_enabled())
> > +		return;
> > +
> > +	if (!mm || !mm->pcpu_sched)
> > +		return;
> > +
> > +	/* avoid moving backwards */
> > +	if (mm->mm_sched_epoch >= rq->cpu_epoch)
> > +		return;
> 
> IIRC this was supposed to be able to wrap; which then means you should
> write it like:
> 
> 	if ((mm->mm_sched_epoch - rq->cpu_epoch) >= 0)
> 		return;
> 
> or somesuch.

Okay. Got it.

Tim

> 
> > +
> > +	guard(raw_spinlock)(&mm->mm_sched_lock);
> > +
> > +	if (work->next == work) {
> > +		task_work_add(p, work, TWA_RESUME);
> > +		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
> > +	}
> > +}

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes
  2025-12-09 12:11   ` Peter Zijlstra
@ 2025-12-09 22:34     ` Tim Chen
  0 siblings, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-09 22:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Tue, 2025-12-09 at 13:11 +0100, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:24PM -0800, Tim Chen wrote:
> > With cache-aware scheduling enabled, each task is assigned a
> > preferred LLC ID. This allows quick identification of the LLC domain
> > where the task prefers to run, similar to numa_preferred_nid in
> > NUMA balancing.
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0a3918269906..10cec83f65d5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1300,6 +1300,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> >  	struct mm_struct *mm = p->mm;
> >  	struct mm_sched *pcpu_sched;
> >  	unsigned long epoch;
> > +	int mm_sched_llc = -1;
> >  
> >  	if (!sched_cache_enabled())
> >  		return;
> > @@ -1330,6 +1331,23 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> >  		if (mm->mm_sched_cpu != -1)
> >  			mm->mm_sched_cpu = -1;
> >  	}
> > +
> > +	if (mm->mm_sched_cpu != -1) {
> > +		mm_sched_llc = llc_id(mm->mm_sched_cpu);
> > +
> > +#ifdef CONFIG_NUMA_BALANCING
> > +		/*
> > +		 * Don't assign preferred LLC if it
> > +		 * conflicts with NUMA balancing.
> > +		 */
> > +		if (p->numa_preferred_nid >= 0 &&
> > +		    cpu_to_node(mm->mm_sched_cpu) != p->numa_preferred_nid)
> > +			mm_sched_llc = -1;
> > +#endif
> > +	}
> > +
> > +	if (p->preferred_llc != mm_sched_llc)
> > +		p->preferred_llc = mm_sched_llc;
> >  }
> 
> This can of course still happen when sched_setnuma() gets called. I'm
> thinking it is not much of an issue because we expect this thing to get
> called fairly regularly -- at a higher rate than sched_setnuma() at
> least -- and thus the conflict only exists for a short period of time?
> 
> If so, that would make for a good comment.

Sure.  Will do.

> 
> Additionally, we could of course search for the busiest LLC inside the
> node, instead of setting -1. Again, that could live as a comment for
> future work.

A potential issue with scanning only the preferred node of a single task 
is that tasks within the same process may have different preferred nodes.
For example, task 1 may prefer one node, while tasks 2…n prefer another. 
If we base the busiest-LLC scan solely on task 1’s preference, we may 
ignore the preferences of tasks 2…n. Consequently, constraining 
the preferred LLC according to task 1’s node can interfere with 
NUMA balancing for the rest of the process. This problem does not 
arise when all tasks being aggregated belong to the same numa_group, 
since they will share the same preferred node.

Tim


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue
  2025-12-09 12:16   ` Peter Zijlstra
@ 2025-12-09 22:55     ` Tim Chen
  2025-12-10  9:42       ` Peter Zijlstra
  0 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-09 22:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Tue, 2025-12-09 at 13:16 +0100, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:25PM -0800, Tim Chen wrote:
> 
> 
> >  #ifdef CONFIG_SCHED_CACHE
> >  	struct callback_head		cache_work;
> > +	/*the p is currently refcounted in a rq's preferred llc stats*/
> 
> Shall we have spaces after and before the comment marks?
> 
> Also, comment confuses me, I don't see get_task_struct() /
> put_task_struct() usage. Did you mean something else with refcount?

It is the accounting for number of tasks preferring
a certain LLC on a runqueue during enqueue/dequeue,
or when a task's LLC preference changes, by
account_llc_enqueue() and account_llc_dequeue()

How about change he comment to

	/* LLC preference accounting should be done in dequeue */
> 
> > +	bool				sched_llc_active;
> >  	int				preferred_llc;
> >  #endif
> 
> > +static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
> > +{
> > +	int pref_llc;
> > +
> > +	/*
> > +	 * Borrow the uc_se->active from uclamp_rq_inc_id(),
> > +	 * uclamp_rq_dec_id() to avoid the unbalanced calculation
> > +	 * of rq statistics.
> > +	 */
> > +	if (unlikely(!p->sched_llc_active))
> > +		return;
> 
> Another very confusing comment; what? Also, can you please explain (in
> the new comment) how we get here without having llc_active set?

The comment meant to say that we are using a similar mechanism as
accounting done in uc_se->active from uclamp_rq_inc_id(). I agree that
it confuses more than making things clearer.

How about the following comment to make things clearer:

	/*
	 * Cache aware scheduling was active when the task was enqueued.
	 * Admin has disabled cache aware scheduling before task was dequeued
	 * but the accounting has to be kept straight in case cache aware scheduling
	 * is re-enabled.
	 */

> 
> > +
> > +	pref_llc = p->preferred_llc;
> > +	if (pref_llc < 0)
> > +		return;
> > +
> > +	rq->nr_llc_running--;
> > +	rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
> > +	p->sched_llc_active = false;
> > +}

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter
  2025-12-09 13:06   ` Peter Zijlstra
@ 2025-12-09 23:17     ` Tim Chen
  0 siblings, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-09 23:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Tue, 2025-12-09 at 14:06 +0100, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:26PM -0800, Tim Chen wrote:
> 
> > +#ifdef CONFIG_SCHED_CACHE
> > +
> > +static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int **gc)
> > +{
> > +	unsigned int *new = NULL;
> > +
> > +	new = kcalloc(new_max_llcs, sizeof(unsigned int),
> > +		      GFP_KERNEL | __GFP_NOWARN);
> > +
> > +	if (!new) {
> > +		*gc = NULL;
> > +	} else {
> > +		/*
> > +		 * Place old entry in garbage collector
> > +		 * for later disposal.
> > +		 */
> > +		*gc = old;
> > +	}
> > +	return new;
> > +}
> > +
> > +static void populate_new_pref_llcs(unsigned int *old, unsigned int *new)
> > +{
> > +	int i;
> > +
> > +	if (!old)
> > +		return;
> > +
> > +	for (i = 0; i < max_llcs; i++)
> > +		new[i] = old[i];
> > +}
> > +
> > +static int resize_llc_pref(void)
> > +{
> > +	unsigned int *__percpu *tmp_llc_pref;
> > +	int i, ret = 0;
> > +
> > +	if (new_max_llcs <= max_llcs)
> > +		return 0;
> > +
> > +	/*
> > +	 * Allocate temp percpu pointer for old llc_pref,
> > +	 * which will be released after switching to the
> > +	 * new buffer.
> > +	 */
> > +	tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
> > +	if (!tmp_llc_pref)
> > +		return -ENOMEM;
> > +
> > +	for_each_present_cpu(i)
> > +		*per_cpu_ptr(tmp_llc_pref, i) = NULL;
> > +
> > +	/*
> > +	 * Resize the per rq nr_pref_llc buffer and
> > +	 * switch to this new buffer.
> > +	 */
> > +	for_each_present_cpu(i) {
> > +		struct rq_flags rf;
> > +		unsigned int *new;
> > +		struct rq *rq;
> > +
> > +		rq = cpu_rq(i);
> > +		new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
> > +		if (!new) {
> > +			ret = -ENOMEM;
> > +
> > +			goto release_old;
> > +		}
> > +
> > +		/*
> > +		 * Locking rq ensures that rq->nr_pref_llc values
> > +		 * don't change with new task enqueue/dequeue
> > +		 * when we repopulate the newly enlarged array.
> > +		 */
> 
> 		guard(rq_lock_irq)(rq);
> 
> Notably, this cannot be with IRQs disabled, as you're doing allocations.

Okay.

> 
> > +		rq_lock_irqsave(rq, &rf);
> > +		populate_new_pref_llcs(rq->nr_pref_llc, new);
> > +		rq->nr_pref_llc = new;
> > +		rq_unlock_irqrestore(rq, &rf);
> > +	}
> > +
> > +release_old:
> > +	/*
> > +	 * Load balance is done under rcu_lock.
> > +	 * Wait for load balance before and during resizing to
> > +	 * be done. They may refer to old nr_pref_llc[]
> > +	 * that hasn't been resized.
> > +	 */
> > +	synchronize_rcu();
> > +	for_each_present_cpu(i)
> > +		kfree(*per_cpu_ptr(tmp_llc_pref, i));
> > +
> > +	free_percpu(tmp_llc_pref);
> > +
> > +	/* succeed and update */
> > +	if (!ret)
> > +		max_llcs = new_max_llcs;
> > +
> > +	return ret;
> > +}
> 
> I think you need at least cpus_read_lock(), because present_cpu is
> dynamic -- but I'm not quite sure what lock is used to serialize it.

Let me check on what is the right lock for making sure present_cpu
is not chenged.  Thanks.

Tim

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing
  2025-12-03 23:07 ` [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
  2025-12-09 11:12   ` Peter Zijlstra
@ 2025-12-10  9:37   ` Peter Zijlstra
  2025-12-10 13:57     ` Chen, Yu C
  2025-12-11  9:03   ` Vern Hao
       [not found]   ` <fbf52d91-0605-4608-b9cc-e8cc56115fd5@gmail.com>
  3 siblings, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10  9:37 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:20PM -0800, Tim Chen wrote:

> +static unsigned long __no_profile fraction_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)

> +static void __no_profile task_cache_work(struct callback_head *work)

What's with the random __no_profile things?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue
  2025-12-09 22:55     ` Tim Chen
@ 2025-12-10  9:42       ` Peter Zijlstra
  2025-12-16  0:20         ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10  9:42 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Tue, Dec 09, 2025 at 02:55:21PM -0800, Tim Chen wrote:

> > > +static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
> > > +{
> > > +	int pref_llc;
> > > +
> > > +	/*
> > > +	 * Borrow the uc_se->active from uclamp_rq_inc_id(),
> > > +	 * uclamp_rq_dec_id() to avoid the unbalanced calculation
> > > +	 * of rq statistics.
> > > +	 */
> > > +	if (unlikely(!p->sched_llc_active))
> > > +		return;
> > 
> > Another very confusing comment; what? Also, can you please explain (in
> > the new comment) how we get here without having llc_active set?
> 
> The comment meant to say that we are using a similar mechanism as
> accounting done in uc_se->active from uclamp_rq_inc_id(). I agree that
> it confuses more than making things clearer.
> 
> How about the following comment to make things clearer:
> 
> 	/*
> 	 * Cache aware scheduling was active when the task was enqueued.
> 	 * Admin has disabled cache aware scheduling before task was dequeued
> 	 * but the accounting has to be kept straight in case cache aware scheduling
> 	 * is re-enabled.
> 	 */

Is having that sched_cache_enabled() test worth it?
account_numa_{en,de}queue() don't seem to have any of this.


> > > +	pref_llc = p->preferred_llc;
> > > +	if (pref_llc < 0)
> > > +		return;
> > > +
> > > +	rq->nr_llc_running--;
> > > +	rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
> > > +	p->sched_llc_active = false;
> > > +}

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter
  2025-12-03 23:07 ` [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter Tim Chen
  2025-12-09 13:06   ` Peter Zijlstra
@ 2025-12-10 12:43   ` Peter Zijlstra
  2025-12-10 18:36     ` Tim Chen
  2025-12-10 12:51   ` Peter Zijlstra
  2 siblings, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10 12:43 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:26PM -0800, Tim Chen wrote:

> +static int resize_llc_pref(void)
> +{
> +	unsigned int *__percpu *tmp_llc_pref;
> +	int i, ret = 0;
> +
> +	if (new_max_llcs <= max_llcs)
> +		return 0;
> +
> +	/*
> +	 * Allocate temp percpu pointer for old llc_pref,
> +	 * which will be released after switching to the
> +	 * new buffer.
> +	 */
> +	tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
> +	if (!tmp_llc_pref)
> +		return -ENOMEM;
> +
> +	for_each_present_cpu(i)
> +		*per_cpu_ptr(tmp_llc_pref, i) = NULL;
> +
> +	/*
> +	 * Resize the per rq nr_pref_llc buffer and
> +	 * switch to this new buffer.
> +	 */
> +	for_each_present_cpu(i) {
> +		struct rq_flags rf;
> +		unsigned int *new;
> +		struct rq *rq;
> +
> +		rq = cpu_rq(i);
> +		new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
> +		if (!new) {
> +			ret = -ENOMEM;
> +
> +			goto release_old;
> +		}
> +
> +		/*
> +		 * Locking rq ensures that rq->nr_pref_llc values
> +		 * don't change with new task enqueue/dequeue
> +		 * when we repopulate the newly enlarged array.
> +		 */
> +		rq_lock_irqsave(rq, &rf);
> +		populate_new_pref_llcs(rq->nr_pref_llc, new);
> +		rq->nr_pref_llc = new;
> +		rq_unlock_irqrestore(rq, &rf);
> +	}
> +
> +release_old:
> +	/*
> +	 * Load balance is done under rcu_lock.
> +	 * Wait for load balance before and during resizing to
> +	 * be done. They may refer to old nr_pref_llc[]
> +	 * that hasn't been resized.
> +	 */
> +	synchronize_rcu();
> +	for_each_present_cpu(i)
> +		kfree(*per_cpu_ptr(tmp_llc_pref, i));
> +
> +	free_percpu(tmp_llc_pref);
> +
> +	/* succeed and update */
> +	if (!ret)
> +		max_llcs = new_max_llcs;
> +
> +	return ret;
> +}

> @@ -2674,6 +2787,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  	if (has_cluster)
>  		static_branch_inc_cpuslocked(&sched_cluster_active);
>  
> +	resize_llc_pref();
> +
>  	if (rq && sched_debug_verbose)
>  		pr_info("root domain span: %*pbl\n", cpumask_pr_args(cpu_map));

I suspect people will hate on you for that synchronize_rcu() in there.

Specifically, we do build_sched_domain() for every CPU brought online,
this means booting 512 CPUs now includes 512 sync_rcu()s.

Worse, IIRC sync_rcu() is O(n) (or worse -- could be n*ln(n)) in number
of CPUs, so the total thing will be O(n^2) (or worse) for bringing CPUs
online.




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter
  2025-12-03 23:07 ` [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter Tim Chen
  2025-12-09 13:06   ` Peter Zijlstra
  2025-12-10 12:43   ` Peter Zijlstra
@ 2025-12-10 12:51   ` Peter Zijlstra
  2025-12-10 18:49     ` Tim Chen
  2 siblings, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10 12:51 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:26PM -0800, Tim Chen wrote:

> +static int resize_llc_pref(void)
> +{
> +	unsigned int *__percpu *tmp_llc_pref;
> +	int i, ret = 0;
> +
> +	if (new_max_llcs <= max_llcs)
> +		return 0;
> +
> +	/*
> +	 * Allocate temp percpu pointer for old llc_pref,
> +	 * which will be released after switching to the
> +	 * new buffer.
> +	 */
> +	tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
> +	if (!tmp_llc_pref)
> +		return -ENOMEM;
> +
> +	for_each_present_cpu(i)
> +		*per_cpu_ptr(tmp_llc_pref, i) = NULL;
> +
> +	/*
> +	 * Resize the per rq nr_pref_llc buffer and
> +	 * switch to this new buffer.
> +	 */
> +	for_each_present_cpu(i) {
> +		struct rq_flags rf;
> +		unsigned int *new;
> +		struct rq *rq;
> +
> +		rq = cpu_rq(i);
> +		new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
> +		if (!new) {
> +			ret = -ENOMEM;
> +
> +			goto release_old;
> +		}
> +
> +		/*
> +		 * Locking rq ensures that rq->nr_pref_llc values
> +		 * don't change with new task enqueue/dequeue
> +		 * when we repopulate the newly enlarged array.
> +		 */
> +		rq_lock_irqsave(rq, &rf);
> +		populate_new_pref_llcs(rq->nr_pref_llc, new);
> +		rq->nr_pref_llc = new;
> +		rq_unlock_irqrestore(rq, &rf);
> +	}
> +
> +release_old:
> +	/*
> +	 * Load balance is done under rcu_lock.
> +	 * Wait for load balance before and during resizing to
> +	 * be done. They may refer to old nr_pref_llc[]
> +	 * that hasn't been resized.
> +	 */
> +	synchronize_rcu();
> +	for_each_present_cpu(i)
> +		kfree(*per_cpu_ptr(tmp_llc_pref, i));
> +
> +	free_percpu(tmp_llc_pref);
> +
> +	/* succeed and update */
> +	if (!ret)
> +		max_llcs = new_max_llcs;
> +
> +	return ret;
> +}

Would it perhaps be easier to stick this thing in rq->sd rather than in
rq->nr_pref_llc. That way it automagically switches with the 'new'
domain. And then, with a bit of care, a singe load-balance pass should
see a consistent view (there should not be reloads of rq->sd -- which
will be a bit of an audit I suppose).

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 09/23] sched/cache: Count tasks prefering destination LLC in a sched group
  2025-12-03 23:07 ` [PATCH v2 09/23] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
@ 2025-12-10 12:52   ` Peter Zijlstra
  2025-12-10 14:05     ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10 12:52 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:28PM -0800, Tim Chen wrote:
> During LLC load balancing, tabulate the number of tasks on each runqueue
> that prefer the LLC contains the env->dst_cpu in a sched group.
> 
> For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
> balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
> 2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
> selected as the busiest source to pick tasks from.
> 
> Within a source LLC, the total number of tasks preferring a destination
> LLC is computed by summing counts across all CPUs in that LLC. For
> instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
> LLC3, the total for LLC0 is 3.
> 
> These statistics allow the load balancer to choose tasks from source
> sched groups that best match their preferred LLCs.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> 
> Notes:
>     v1->v2:
>         Convert nr_pref_llc array in sg_lb_stats to a single
>         variable as only the dst LLC stat is needed.
>         (K Prateek Nayak)
> 
>  kernel/sched/fair.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b0e87616e377..4d7803f69a74 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10445,6 +10445,9 @@ struct sg_lb_stats {
>  	unsigned int nr_numa_running;
>  	unsigned int nr_preferred_running;
>  #endif
> +#ifdef CONFIG_SCHED_CACHE
> +	unsigned int nr_pref_llc;
> +#endif

At this point I have to note that rq->nr_pref_llc seems like a horrible
misnomer, for it being an array, and not an actual number like the
naming suggests.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 12/23] sched/cache: Add migrate_llc_task migration type for cache-aware balancing
  2025-12-03 23:07 ` [PATCH v2 12/23] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
@ 2025-12-10 13:32   ` Peter Zijlstra
  2025-12-16  0:52     ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10 13:32 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:31PM -0800, Tim Chen wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index db555c11b5b8..529adf342ce0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9547,7 +9547,8 @@ enum migration_type {
>  	migrate_load = 0,
>  	migrate_util,
>  	migrate_task,
> -	migrate_misfit
> +	migrate_misfit,
> +	migrate_llc_task
>  };
>  
>  #define LBF_ALL_PINNED	0x01
> @@ -10134,6 +10135,10 @@ static int detach_tasks(struct lb_env *env)
>  			env->imbalance -= util;
>  			break;
>  
> +		case migrate_llc_task:
> +			env->imbalance--;
> +			break;
> +
>  		case migrate_task:
>  			env->imbalance--;
>  			break;

> @@ -12181,6 +12199,16 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
>  			}
>  			break;
>  
> +		case migrate_llc_task:
> +#ifdef CONFIG_SCHED_CACHE
> +			dst_llc = llc_id(env->dst_cpu);
> +			if (dst_llc >= 0 &&
> +			    busiest_pref_llc < rq->nr_pref_llc[dst_llc]) {
> +				busiest_pref_llc = rq->nr_pref_llc[dst_llc];
> +				busiest = rq;
> +			}
> +#endif
> +			break;
>  		case migrate_task:
>  			if (busiest_nr < nr_running) {
>  				busiest_nr = nr_running;
> @@ -12363,6 +12391,8 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
>  	case migrate_misfit:
>  		__schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
>  		break;
> +	case migrate_llc_task:
> +		break;
>  	}
>  }

The enum and all switch statements had the same order; you wrecked it!

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing
  2025-12-10  9:37   ` Peter Zijlstra
@ 2025-12-10 13:57     ` Chen, Yu C
  2025-12-10 15:11       ` Peter Zijlstra
  0 siblings, 1 reply; 111+ messages in thread
From: Chen, Yu C @ 2025-12-10 13:57 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On 12/10/2025 6:37 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:20PM -0800, Tim Chen wrote:
> 
>> +static unsigned long __no_profile fraction_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
> 
>> +static void __no_profile task_cache_work(struct callback_head *work)
> 
> What's with the random __no_profile things?

In the early version without this tag we got some error reports by gcov.
We will check if this issue still exists and do some investigations.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 02/23] sched/cache: Record per-LLC utilization to guide cache-aware scheduling decisions
  2025-12-09 11:21   ` Peter Zijlstra
@ 2025-12-10 14:02     ` Chen, Yu C
  2025-12-10 15:13       ` Peter Zijlstra
  0 siblings, 1 reply; 111+ messages in thread
From: Chen, Yu C @ 2025-12-10 14:02 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On 12/9/2025 8:21 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:21PM -0800, Tim Chen wrote:
> 
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index bbcfdf12aa6e..0ba4697d74ba 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -68,6 +68,10 @@ struct sched_domain_shared {
>>   	atomic_t	nr_busy_cpus;
>>   	int		has_idle_cores;
>>   	int		nr_idle_scan;
>> +#ifdef CONFIG_SCHED_CACHE
>> +	unsigned long	util_avg;
>> +	unsigned long	capacity ____cacheline_aligned_in_smp;
> 
> This cacheline annotation confuses me, see below.
> 
>> +#endif
>>   };
>>   
>>   struct sched_domain {
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index cb82f558dc5b..b9f336300f14 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9622,6 +9622,29 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
>>   	return 0;
>>   }
>>   
>> +#ifdef CONFIG_SCHED_CACHE
>> +/* Called from load balancing paths with rcu_read_lock held */
>> +static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
>> +					 unsigned long *cap)
>> +{
>> +	struct sched_domain_shared *sd_share;
>> +
>> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> +	if (!sd_share)
>> +		return false;
>> +
>> +	*util = READ_ONCE(sd_share->util_avg);
>> +	*cap = READ_ONCE(sd_share->capacity);
> 
> You placed capacity on a separate line, forcing the above to be 2
> distinct lines. That seems... sub-optimal?
> 

The reason capacity was placed in a separate cache line
is that writes to capacity are not very frequent(cpu hotplug
should not happen too-frequently), while writes to util_avg
tend to be relatively frequent(changes frequently).
If capacity and util_avg were placed in the same cache line,
I’m thinking writes to util_avg might invalidate the entire
cache line. This could potentially cause cache misses when
capacity is read elsewhere, which might lead to false sharing?

thanks,
Chenyu

>> +
>> +	return true;
>> +}
>> +#else
>> +static inline bool get_llc_stats(int cpu, unsigned long *util,
>> +				 unsigned long *cap)
>> +{
>> +	return false;
>> +}
>> +#endif
>>   /*
>>    * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
>>    */
>> @@ -10592,6 +10615,51 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>>   	return check_cpu_capacity(rq, sd);
>>   }
>>   
>> +#ifdef CONFIG_SCHED_CACHE
>> +/*
>> + * Record the statistics for this scheduler group for later
>> + * use. These values guide load balancing on aggregating tasks
>> + * to a LLC.
>> + */
>> +static void record_sg_llc_stats(struct lb_env *env,
>> +				struct sg_lb_stats *sgs,
>> +				struct sched_group *group)
>> +{
>> +	struct sched_domain_shared *sd_share;
>> +
>> +	if (!sched_cache_enabled() || env->idle == CPU_NEWLY_IDLE)
>> +		return;
>> +
>> +	/* Only care about sched domain spanning multiple LLCs */
>> +	if (env->sd->child != rcu_dereference(per_cpu(sd_llc, env->dst_cpu)))
>> +		return;
>> +
>> +	/*
>> +	 * At this point we know this group spans a LLC domain.
>> +	 * Record the statistic of this group in its corresponding
>> +	 * shared LLC domain.
>> +	 * Note: sd_share cannot be obtained via sd->child->shared, because
>> +	 * it refers to the domain that covers the local group, while
>> +	 * sd_share could represent any of the LLC group.
>> +	 */
>> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared,
>> +					   cpumask_first(sched_group_span(group))));
>> +	if (!sd_share)
>> +		return;
>> +
>> +	if (READ_ONCE(sd_share->util_avg) != sgs->group_util)
>> +		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
>> +
>> +	if (unlikely(READ_ONCE(sd_share->capacity) != sgs->group_capacity))
>> +		WRITE_ONCE(sd_share->capacity, sgs->group_capacity);
> 
> And same here.
> 
>> +}
>> +#else
>> +static inline void record_sg_llc_stats(struct lb_env *env, struct sg_lb_stats *sgs,
>> +				       struct sched_group *group)
>> +{
>> +}
>> +#endif

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 09/23] sched/cache: Count tasks prefering destination LLC in a sched group
  2025-12-10 12:52   ` Peter Zijlstra
@ 2025-12-10 14:05     ` Chen, Yu C
  2025-12-10 15:16       ` Peter Zijlstra
  0 siblings, 1 reply; 111+ messages in thread
From: Chen, Yu C @ 2025-12-10 14:05 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On 12/10/2025 9:52 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:28PM -0800, Tim Chen wrote:
>> During LLC load balancing, tabulate the number of tasks on each runqueue
>> that prefer the LLC contains the env->dst_cpu in a sched group.
>>
>> For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
>> balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
>> 2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
>> selected as the busiest source to pick tasks from.
>>
>> Within a source LLC, the total number of tasks preferring a destination
>> LLC is computed by summing counts across all CPUs in that LLC. For
>> instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
>> LLC3, the total for LLC0 is 3.
>>
>> These statistics allow the load balancer to choose tasks from source
>> sched groups that best match their preferred LLCs.
>>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>
>> Notes:
>>      v1->v2:
>>          Convert nr_pref_llc array in sg_lb_stats to a single
>>          variable as only the dst LLC stat is needed.
>>          (K Prateek Nayak)
>>
>>   kernel/sched/fair.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b0e87616e377..4d7803f69a74 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -10445,6 +10445,9 @@ struct sg_lb_stats {
>>   	unsigned int nr_numa_running;
>>   	unsigned int nr_preferred_running;
>>   #endif
>> +#ifdef CONFIG_SCHED_CACHE
>> +	unsigned int nr_pref_llc;
>> +#endif
> 
> At this point I have to note that rq->nr_pref_llc seems like a horrible
> misnomer, for it being an array, and not an actual number like the
> naming suggests.

In the v2 it seems that rq->nr_pref_llc is not an array anymore, it 
indicates
the number of tasks that want to be migrated to the env->dst_cpu 
(dst_llc), because
these tasks' preferred LLC are env->dst_cpu(dst_llc). Maybe renaming it to
rq->nr_pref_dst_llc?

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing
  2025-12-10 13:57     ` Chen, Yu C
@ 2025-12-10 15:11       ` Peter Zijlstra
  0 siblings, 0 replies; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10 15:11 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 10, 2025 at 10:57:27PM +0900, Chen, Yu C wrote:
> On 12/10/2025 6:37 PM, Peter Zijlstra wrote:
> > On Wed, Dec 03, 2025 at 03:07:20PM -0800, Tim Chen wrote:
> > 
> > > +static unsigned long __no_profile fraction_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
> > 
> > > +static void __no_profile task_cache_work(struct callback_head *work)
> > 
> > What's with the random __no_profile things?
> 
> In the early version without this tag we got some error reports by gcov.
> We will check if this issue still exists and do some investigations.

That would be weird, nothing in the scheduler has anything like it. So
yeah, please see if you can find out where that came from.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 02/23] sched/cache: Record per-LLC utilization to guide cache-aware scheduling decisions
  2025-12-10 14:02     ` Chen, Yu C
@ 2025-12-10 15:13       ` Peter Zijlstra
  2025-12-10 23:58         ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10 15:13 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 10, 2025 at 11:02:39PM +0900, Chen, Yu C wrote:
> On 12/9/2025 8:21 PM, Peter Zijlstra wrote:
> > On Wed, Dec 03, 2025 at 03:07:21PM -0800, Tim Chen wrote:
> > 
> > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > > index bbcfdf12aa6e..0ba4697d74ba 100644
> > > --- a/include/linux/sched/topology.h
> > > +++ b/include/linux/sched/topology.h
> > > @@ -68,6 +68,10 @@ struct sched_domain_shared {
> > >   	atomic_t	nr_busy_cpus;
> > >   	int		has_idle_cores;
> > >   	int		nr_idle_scan;
> > > +#ifdef CONFIG_SCHED_CACHE
> > > +	unsigned long	util_avg;
> > > +	unsigned long	capacity ____cacheline_aligned_in_smp;
> > 
> > This cacheline annotation confuses me, see below.
> > 
> > > +#endif
> > >   };
> > >   struct sched_domain {
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index cb82f558dc5b..b9f336300f14 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -9622,6 +9622,29 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
> > >   	return 0;
> > >   }
> > > +#ifdef CONFIG_SCHED_CACHE
> > > +/* Called from load balancing paths with rcu_read_lock held */
> > > +static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
> > > +					 unsigned long *cap)
> > > +{
> > > +	struct sched_domain_shared *sd_share;
> > > +
> > > +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > > +	if (!sd_share)
> > > +		return false;
> > > +
> > > +	*util = READ_ONCE(sd_share->util_avg);
> > > +	*cap = READ_ONCE(sd_share->capacity);
> > 
> > You placed capacity on a separate line, forcing the above to be 2
> > distinct lines. That seems... sub-optimal?
> > 
> 
> The reason capacity was placed in a separate cache line
> is that writes to capacity are not very frequent(cpu hotplug
> should not happen too-frequently), while writes to util_avg
> tend to be relatively frequent(changes frequently).
> If capacity and util_avg were placed in the same cache line,
> I’m thinking writes to util_avg might invalidate the entire
> cache line. This could potentially cause cache misses when
> capacity is read elsewhere, which might lead to false sharing?

But its introduced here and already read/written together. Is this not
premature optimization?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 09/23] sched/cache: Count tasks prefering destination LLC in a sched group
  2025-12-10 14:05     ` Chen, Yu C
@ 2025-12-10 15:16       ` Peter Zijlstra
  2025-12-10 19:00         ` Tim Chen
  2025-12-10 23:50         ` Chen, Yu C
  0 siblings, 2 replies; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10 15:16 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 10, 2025 at 11:05:33PM +0900, Chen, Yu C wrote:
> On 12/10/2025 9:52 PM, Peter Zijlstra wrote:
> > On Wed, Dec 03, 2025 at 03:07:28PM -0800, Tim Chen wrote:
> > > During LLC load balancing, tabulate the number of tasks on each runqueue
> > > that prefer the LLC contains the env->dst_cpu in a sched group.
> > > 
> > > For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
> > > balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
> > > 2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
> > > selected as the busiest source to pick tasks from.
> > > 
> > > Within a source LLC, the total number of tasks preferring a destination
> > > LLC is computed by summing counts across all CPUs in that LLC. For
> > > instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
> > > LLC3, the total for LLC0 is 3.
> > > 
> > > These statistics allow the load balancer to choose tasks from source
> > > sched groups that best match their preferred LLCs.
> > > 
> > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > ---
> > > 
> > > Notes:
> > >      v1->v2:
> > >          Convert nr_pref_llc array in sg_lb_stats to a single
> > >          variable as only the dst LLC stat is needed.
> > >          (K Prateek Nayak)
> > > 
> > >   kernel/sched/fair.c | 12 ++++++++++++
> > >   1 file changed, 12 insertions(+)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index b0e87616e377..4d7803f69a74 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10445,6 +10445,9 @@ struct sg_lb_stats {
> > >   	unsigned int nr_numa_running;
> > >   	unsigned int nr_preferred_running;
> > >   #endif
> > > +#ifdef CONFIG_SCHED_CACHE
> > > +	unsigned int nr_pref_llc;
> > > +#endif
> > 
> > At this point I have to note that rq->nr_pref_llc seems like a horrible
> > misnomer, for it being an array, and not an actual number like the
> > naming suggests.
> 
> In the v2 it seems that rq->nr_pref_llc is not an array anymore, it

From two patches ago:

+       unsigned int            *nr_pref_llc;

Its a pointer of some sort.


> indicates
> the number of tasks that want to be migrated to the env->dst_cpu (dst_llc),
> because
> these tasks' preferred LLC are env->dst_cpu(dst_llc). Maybe renaming it to
> rq->nr_pref_dst_llc?

Like I said in:

  https://lkml.kernel.org/r/20251210125114.GS3707891@noisy.programming.kicks-ass.net

it might make sense to put it in struct sched_domain instead of struct
rq, since then you can allocate and swap it right along with the rest of
the domain tree.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 14/23] sched/cache: Consider LLC preference when selecting tasks for load balancing
  2025-12-03 23:07 ` [PATCH v2 14/23] sched/cache: Consider LLC preference when selecting tasks for load balancing Tim Chen
@ 2025-12-10 15:58   ` Peter Zijlstra
  0 siblings, 0 replies; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10 15:58 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:33PM -0800, Tim Chen wrote:
> Currently, task selection from the busiest runqueue ignores LLC
> preferences. Reorder tasks in the busiest queue to prioritize selection
> as follows:
> 
>   1. Tasks preferring the destination CPU's LLC
>   2. Tasks with no LLC preference
>   3. Tasks preferring an LLC different from their current one
>   4. Tasks preferring the LLC they are currently on
> 
> This improves the likelihood that tasks are migrated to their
> preferred LLC.
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> 
> Notes:
>     v1->v2: No change.
> 
>  kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 65 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aed3fab98d7c..dd09a816670e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10092,6 +10092,68 @@ static struct task_struct *detach_one_task(struct lb_env *env)
>  	return NULL;
>  }
>  
> +#ifdef CONFIG_SCHED_CACHE
> +/*
> + * Prepare lists to detach tasks in the following order:
> + * 1. tasks that prefer dst cpu's LLC
> + * 2. tasks that have no preference in LLC
> + * 3. tasks that prefer LLC other than the ones they are on
> + * 4. tasks that prefer the LLC that they are currently on.
> + */
> +static struct list_head
> +*order_tasks_by_llc(struct lb_env *env, struct list_head *tasks)
> +{
> +	struct task_struct *p;
> +	LIST_HEAD(pref_old_llc);
> +	LIST_HEAD(pref_new_llc);
> +	LIST_HEAD(no_pref_llc);
> +	LIST_HEAD(pref_other_llc);
> +
> +	if (!sched_cache_enabled())
> +		return tasks;
> +
> +	if (cpus_share_cache(env->dst_cpu, env->src_cpu))
> +		return tasks;
> +
> +	while (!list_empty(tasks)) {
> +		p = list_last_entry(tasks, struct task_struct, se.group_node);
> +
> +		if (p->preferred_llc == llc_id(env->dst_cpu)) {
> +			list_move(&p->se.group_node, &pref_new_llc);
> +			continue;
> +		}
> +
> +		if (p->preferred_llc == llc_id(env->src_cpu)) {
> +			list_move(&p->se.group_node, &pref_old_llc);
> +			continue;
> +		}
> +
> +		if (p->preferred_llc == -1) {
> +			list_move(&p->se.group_node, &no_pref_llc);
> +			continue;
> +		}
> +
> +		list_move(&p->se.group_node, &pref_other_llc);
> +	}
> +
> +	/*
> +	 * We detach tasks from list tail in detach tasks.  Put tasks
> +	 * to be chosen first at end of list.
> +	 */
> +	list_splice(&pref_new_llc, tasks);
> +	list_splice(&no_pref_llc, tasks);
> +	list_splice(&pref_other_llc, tasks);
> +	list_splice(&pref_old_llc, tasks);
> +	return tasks;
> +}

> @@ -10119,6 +10181,8 @@ static int detach_tasks(struct lb_env *env)
>  	if (env->imbalance <= 0)
>  		return 0;
>  
> +	tasks = order_tasks_by_llc(env, &env->src_rq->cfs_tasks);
> +
>  	while (!list_empty(tasks)) {
>  		/*
>  		 * We don't want to steal all, otherwise we may be treated likewise,

Humrph. So NUMA balancing does this differently. It skips over the tasks
that would degrade locality in can_migrate_task(); and only if
nr_balanced_failed is high enough do we ignore that.

Also, if there are a significant number of tasks on the list, this gets
in the way of things like loop_break, since it does this sort
unconditionally.

Bah, this feels like there is a sane way to integrate all this, but it
seems to escape me at the moment. I'll ponder it a bit more.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 15/23] sched/cache: Respect LLC preference in task migration and detach
  2025-12-03 23:07 ` [PATCH v2 15/23] sched/cache: Respect LLC preference in task migration and detach Tim Chen
@ 2025-12-10 16:30   ` Peter Zijlstra
  2025-12-16  7:30     ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10 16:30 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:34PM -0800, Tim Chen wrote:

> @@ -10025,6 +10025,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	if (env->flags & LBF_ACTIVE_LB)
>  		return 1;
>  
> +#ifdef CONFIG_SCHED_CACHE
> +	if (sched_cache_enabled() &&
> +	    can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) == mig_forbid &&
> +	    !task_has_sched_core(p))
> +		return 0;
> +#endif

This seems wrong:
 - it does not let nr_balance_failed override things;
 - it takes precedence over migrate_degrade_locality(); you really want
   to migrate towards the preferred NUMA node over staying on your LLC.

That is, this really wants to be done after migrate_degrades_locality()
and only if degrades == 0 or something.

>  	degrades = migrate_degrades_locality(p, env);
>  	if (!degrades)
>  		hot = task_hot(p, env);
> @@ -10146,12 +10153,55 @@ static struct list_head
>  	list_splice(&pref_old_llc, tasks);
>  	return tasks;
>  }
> +
> +static bool stop_migrate_src_rq(struct task_struct *p,
> +				struct lb_env *env,
> +				int detached)
> +{
> +	if (!sched_cache_enabled() || p->preferred_llc == -1 ||
> +	    cpus_share_cache(env->src_cpu, env->dst_cpu) ||
> +	    env->sd->nr_balance_failed)
> +		return false;

But you are allowing nr_balance_failed to override things here.

> +	/*
> +	 * Stop migration for the src_rq and pull from a
> +	 * different busy runqueue in the following cases:
> +	 *
> +	 * 1. Trying to migrate task to its preferred
> +	 *    LLC, but the chosen task does not prefer dest
> +	 *    LLC - case 3 in order_tasks_by_llc(). This violates
> +	 *    the goal of migrate_llc_task. However, we should
> +	 *    stop detaching only if some tasks have been detached
> +	 *    and the imbalance has been mitigated.
> +	 *
> +	 * 2. Don't detach more tasks if the remaining tasks want
> +	 *    to stay. We know the remaining tasks all prefer the
> +	 *    current LLC, because after order_tasks_by_llc(), the
> +	 *    tasks that prefer the current LLC are the least favored
> +	 *    candidates to be migrated out.
> +	 */
> +	if (env->migration_type == migrate_llc_task &&
> +	    detached && llc_id(env->dst_cpu) != p->preferred_llc)
> +		return true;
> +
> +	if (llc_id(env->src_cpu) == p->preferred_llc)
> +		return true;
> +
> +	return false;
> +}

Also, I think we have a problem with nr_balance_failed, cache_nice_tries
is 1 for SHARE_LLC; this means for failed=0 we ignore:

 - ineligible tasks
 - llc fail
 - node-degrading / hot

and then the very next round, we do all of them at once, without much
grading.

> @@ -10205,6 +10255,15 @@ static int detach_tasks(struct lb_env *env)
>  
>  		p = list_last_entry(tasks, struct task_struct, se.group_node);
>  
> +		/*
> +		 * Check if detaching current src_rq should be stopped, because
> +		 * doing so would break cache aware load balance. If we stop
> +		 * here, the env->flags has LBF_ALL_PINNED, which would cause
> +		 * the load balance to pull from another busy runqueue.

Uhh, can_migrate_task() will clear that ALL_PINNED thing if we've found
at least one task before getting here.

> +		 */
> +		if (stop_migrate_src_rq(p, env, detached))
> +			break;


Perhaps split cfs_tasks into multiple lists from the get-go? That avoids
this sorting.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 16/23] sched/cache: Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node
  2025-12-03 23:07 ` [PATCH v2 16/23] sched/cache: Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node Tim Chen
@ 2025-12-10 16:32   ` Peter Zijlstra
  2025-12-10 16:52     ` Peter Zijlstra
  2025-12-16  7:31     ` Chen, Yu C
  0 siblings, 2 replies; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10 16:32 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel, Libo Chen

On Wed, Dec 03, 2025 at 03:07:35PM -0800, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
> 
> Cache-aware load balancing should only be enabled if there are more
> than 1 LLCs within 1 NUMA node. sched_cache_present is introduced to
> indicate whether this platform supports this topology.
> 
> Suggested-by: Libo Chen <libo.chen@oracle.com>
> Suggested-by: Adam Li <adamli@os.amperecomputing.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> 
> Notes:
>     v1->v2:
>     	Use flag sched_cache_present to indicate whether a platform
>     	supports cache aware scheduling. Change this flag from staic key.
>     	There should be only 1 static key to control the cache aware
>     	scheduling. (Peter Zijlstra)
> 
>  kernel/sched/topology.c | 20 +++++++++++++++-----
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d583399fc6a1..9799e3a9a609 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -24,6 +24,8 @@ int max_llcs;
>  
>  #ifdef CONFIG_SCHED_CACHE
>  
> +static bool sched_cache_present;

sched_energy_present
sched_asym_cpucapacity
sched_cluster_active
sched_smt_present

are all static keys tied to the current topology, why break the streak
and make this a boolean?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling
  2025-12-03 23:07 ` [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling Tim Chen
@ 2025-12-10 16:51   ` Peter Zijlstra
  2025-12-16  7:40     ` Chen, Yu C
  2025-12-17  9:40   ` Aaron Lu
  1 sibling, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10 16:51 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:36PM -0800, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
> 
> A performance regression was observed by Prateek when running hackbench
> with many threads per process (high fd count). To avoid this, processes
> with a large number of active threads are excluded from cache-aware
> scheduling.
> 
> With sched_cache enabled, record the number of active threads in each
> process during the periodic task_cache_work(). While iterating over
> CPUs, if the currently running task belongs to the same process as the
> task that launched task_cache_work(), increment the active thread count.
> 
> This number will be used by subsequent patch to inhibit cache aware
> load balance.
> 
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> 
> Notes:
>     v1->v2: No change.
> 
>  include/linux/mm_types.h |  1 +
>  kernel/sched/fair.c      | 11 +++++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 1ea16ef90566..04743983de4d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1043,6 +1043,7 @@ struct mm_struct {
>  		raw_spinlock_t mm_sched_lock;
>  		unsigned long mm_sched_epoch;
>  		int mm_sched_cpu;
> +		u64 nr_running_avg ____cacheline_aligned_in_smp;

This is unlikely to do what you hope it does, it will place this
variable on a new cacheline, but will not ensure this variable is the
only one in that line. Notably ogtables_bytes (the next field in this
structure) will share the line.

It might all be less dodgy if you stick these here fields in their own
structure, a little like mm_mm_cid or so.

>  #endif
>  
>  #ifdef CONFIG_MMU
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 580a967efdac..2f38ad82688f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1421,11 +1421,11 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
>  
>  static void __no_profile task_cache_work(struct callback_head *work)
>  {
> -	struct task_struct *p = current;
> +	struct task_struct *p = current, *cur;
>  	struct mm_struct *mm = p->mm;
>  	unsigned long m_a_occ = 0;
>  	unsigned long curr_m_a_occ = 0;
> -	int cpu, m_a_cpu = -1;
> +	int cpu, m_a_cpu = -1, nr_running = 0;
>  	cpumask_var_t cpus;
>  
>  	WARN_ON_ONCE(work != &p->cache_work);
> @@ -1458,6 +1458,12 @@ static void __no_profile task_cache_work(struct callback_head *work)
>  					m_occ = occ;
>  					m_cpu = i;
>  				}

	guard(rcu)();

> +				rcu_read_lock();
> +				cur = rcu_dereference(cpu_rq(i)->curr);
> +				if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
> +				    cur->mm == mm)
> +					nr_running++;
> +				rcu_read_unlock();
>  			}
>  
>  			/*
> @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
>  		mm->mm_sched_cpu = m_a_cpu;
>  	}
>  
> +	update_avg(&mm->nr_running_avg, nr_running);
>  	free_cpumask_var(cpus);
>  }

Its a wee bit weird to introduce nr_running_avg without its user. Makes
it hard to see what's what.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 16/23] sched/cache: Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node
  2025-12-10 16:32   ` Peter Zijlstra
@ 2025-12-10 16:52     ` Peter Zijlstra
  2025-12-16  7:36       ` Chen, Yu C
  2025-12-16  7:31     ` Chen, Yu C
  1 sibling, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10 16:52 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel, Libo Chen

On Wed, Dec 10, 2025 at 05:32:35PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:35PM -0800, Tim Chen wrote:
> > From: Chen Yu <yu.c.chen@intel.com>
> > 
> > Cache-aware load balancing should only be enabled if there are more
> > than 1 LLCs within 1 NUMA node. sched_cache_present is introduced to
> > indicate whether this platform supports this topology.
> > 
> > Suggested-by: Libo Chen <libo.chen@oracle.com>
> > Suggested-by: Adam Li <adamli@os.amperecomputing.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> > 
> > Notes:
> >     v1->v2:
> >     	Use flag sched_cache_present to indicate whether a platform
> >     	supports cache aware scheduling. Change this flag from staic key.
> >     	There should be only 1 static key to control the cache aware
> >     	scheduling. (Peter Zijlstra)
> > 
> >  kernel/sched/topology.c | 20 +++++++++++++++-----
> >  1 file changed, 15 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index d583399fc6a1..9799e3a9a609 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -24,6 +24,8 @@ int max_llcs;
> >  
> >  #ifdef CONFIG_SCHED_CACHE
> >  
> > +static bool sched_cache_present;
> 
> sched_energy_present
> sched_asym_cpucapacity
> sched_cluster_active
> sched_smt_present
> 
> are all static keys tied to the current topology, why break the streak
> and make this a boolean?

Also, patch doesn't use sched_cache_present at all, so perhaps just drop
it on the floor entirely?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling
  2025-12-03 23:07 ` [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling Tim Chen
@ 2025-12-10 17:02   ` Peter Zijlstra
  2025-12-16  7:42     ` Chen, Yu C
  2025-12-19  4:14   ` Vern Hao
  2025-12-23 12:12   ` Yangyu Chen
  2 siblings, 1 reply; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-10 17:02 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:39PM -0800, Tim Chen wrote:

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 466ba8b7398c..95bf080bbbf0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
>  DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>  
>  #ifdef CONFIG_SCHED_CACHE
> +DECLARE_STATIC_KEY_FALSE(sched_cache_on);
> +
>  static inline bool sched_cache_enabled(void)
>  {
> -	return false;
> +	return static_branch_unlikely(&sched_cache_on);
>  }
>  #endif
>  
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 02e16b70a790..cde324672103 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops = {
>  	.release	= single_release,
>  };
>  
> +#ifdef CONFIG_SCHED_CACHE
> +#define SCHED_CACHE_CREATE_CONTROL(name, max)			  \
> +static ssize_t sched_cache_write_##name(struct file *filp,	  \
> +					const char __user *ubuf,  \
> +					size_t cnt, loff_t *ppos) \
> +{								  \
> +	char buf[16];						  \
> +	unsigned int val;					  \
> +	if (cnt > 15)						  \
> +		cnt = 15;					  \
> +	if (copy_from_user(&buf, ubuf, cnt))			  \
> +		return -EFAULT;					  \
> +	buf[cnt] = '\0';					  \


> +	if (kstrtouint(buf, 10, &val))				  \
> +		return -EINVAL;					  \
> +	if (val > (max))						  \
> +		return -EINVAL;					  \
> +	llc_##name = val;					  \
> +	if (!strcmp(#name, "enabled"))				  \
> +		sched_cache_set(false);				  \

Oh gawd :-(

Please just write out all the various write methods and use
kstrtoul_from_user() and kstrtobool_from_user() where applicable.

> +	*ppos += cnt;						  \
> +	return cnt;						  \
> +}								  \
> +static int sched_cache_show_##name(struct seq_file *m, void *v)	  \
> +{								  \
> +	seq_printf(m, "%d\n", llc_##name);			  \
> +	return 0;						  \
> +}								  \
> +static int sched_cache_open_##name(struct inode *inode,		  \
> +				   struct file *filp)		  \
> +{								  \
> +	return single_open(filp, sched_cache_show_##name, NULL);  \
> +}								  \
> +static const struct file_operations sched_cache_fops_##name = {	  \
> +	.open		= sched_cache_open_##name,		  \
> +	.write		= sched_cache_write_##name,		  \
> +	.read		= seq_read,				  \
> +	.llseek		= seq_lseek,				  \
> +	.release	= single_release,			  \
> +}
> +
> +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
> +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
> +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
> +SCHED_CACHE_CREATE_CONTROL(enabled, 1);


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter
  2025-12-10 12:43   ` Peter Zijlstra
@ 2025-12-10 18:36     ` Tim Chen
  0 siblings, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-10 18:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, 2025-12-10 at 13:43 +0100, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:26PM -0800, Tim Chen wrote:
> 
> > +static int resize_llc_pref(void)
> > +{
> > +	unsigned int *__percpu *tmp_llc_pref;
> > +	int i, ret = 0;
> > +
> > +	if (new_max_llcs <= max_llcs)
> > +		return 0;
> > +
> > +	/*
> > +	 * Allocate temp percpu pointer for old llc_pref,
> > +	 * which will be released after switching to the
> > +	 * new buffer.
> > +	 */
> > +	tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
> > +	if (!tmp_llc_pref)
> > +		return -ENOMEM;
> > +
> > +	for_each_present_cpu(i)
> > +		*per_cpu_ptr(tmp_llc_pref, i) = NULL;
> > +
> > +	/*
> > +	 * Resize the per rq nr_pref_llc buffer and
> > +	 * switch to this new buffer.
> > +	 */
> > +	for_each_present_cpu(i) {
> > +		struct rq_flags rf;
> > +		unsigned int *new;
> > +		struct rq *rq;
> > +
> > +		rq = cpu_rq(i);
> > +		new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
> > +		if (!new) {
> > +			ret = -ENOMEM;
> > +
> > +			goto release_old;
> > +		}
> > +
> > +		/*
> > +		 * Locking rq ensures that rq->nr_pref_llc values
> > +		 * don't change with new task enqueue/dequeue
> > +		 * when we repopulate the newly enlarged array.
> > +		 */
> > +		rq_lock_irqsave(rq, &rf);
> > +		populate_new_pref_llcs(rq->nr_pref_llc, new);
> > +		rq->nr_pref_llc = new;
> > +		rq_unlock_irqrestore(rq, &rf);
> > +	}
> > +
> > +release_old:
> > +	/*
> > +	 * Load balance is done under rcu_lock.
> > +	 * Wait for load balance before and during resizing to
> > +	 * be done. They may refer to old nr_pref_llc[]
> > +	 * that hasn't been resized.
> > +	 */
> > +	synchronize_rcu();
> > +	for_each_present_cpu(i)
> > +		kfree(*per_cpu_ptr(tmp_llc_pref, i));
> > +
> > +	free_percpu(tmp_llc_pref);
> > +
> > +	/* succeed and update */
> > +	if (!ret)
> > +		max_llcs = new_max_llcs;
> > +
> > +	return ret;
> > +}
> 
> > @@ -2674,6 +2787,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >  	if (has_cluster)
> >  		static_branch_inc_cpuslocked(&sched_cluster_active);
> >  
> > +	resize_llc_pref();
> > +
> >  	if (rq && sched_debug_verbose)
> >  		pr_info("root domain span: %*pbl\n", cpumask_pr_args(cpu_map));
> 
> I suspect people will hate on you for that synchronize_rcu() in there.
> 
> Specifically, we do build_sched_domain() for every CPU brought online,
> this means booting 512 CPUs now includes 512 sync_rcu()s.
> Worse, IIRC sync_rcu() is O(n) (or worse -- could be n*ln(n)) in number
> of CPUs, so the total thing will be O(n^2) (or worse) for bringing CPUs
> online.
> 
> 

Though we only do sychronize_rcu in resize_llc_pref() when we encounter a new LLC, 
and need a larger array of LLCs, and not on
every CPU. That said, I agree that free is better done in a RCU call back
to avoid scynchronize_rcu overhead.

Tim


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter
  2025-12-10 12:51   ` Peter Zijlstra
@ 2025-12-10 18:49     ` Tim Chen
  2025-12-11 10:31       ` Peter Zijlstra
  0 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-10 18:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, 2025-12-10 at 13:51 +0100, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:26PM -0800, Tim Chen wrote:
> 
> > +static int resize_llc_pref(void)
> > +{
> > +	unsigned int *__percpu *tmp_llc_pref;
> > +	int i, ret = 0;
> > +
> > +	if (new_max_llcs <= max_llcs)
> > +		return 0;
> > +
> > +	/*
> > +	 * Allocate temp percpu pointer for old llc_pref,
> > +	 * which will be released after switching to the
> > +	 * new buffer.
> > +	 */
> > +	tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
> > +	if (!tmp_llc_pref)
> > +		return -ENOMEM;
> > +
> > +	for_each_present_cpu(i)
> > +		*per_cpu_ptr(tmp_llc_pref, i) = NULL;
> > +
> > +	/*
> > +	 * Resize the per rq nr_pref_llc buffer and
> > +	 * switch to this new buffer.
> > +	 */
> > +	for_each_present_cpu(i) {
> > +		struct rq_flags rf;
> > +		unsigned int *new;
> > +		struct rq *rq;
> > +
> > +		rq = cpu_rq(i);
> > +		new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
> > +		if (!new) {
> > +			ret = -ENOMEM;
> > +
> > +			goto release_old;
> > +		}
> > +
> > +		/*
> > +		 * Locking rq ensures that rq->nr_pref_llc values
> > +		 * don't change with new task enqueue/dequeue
> > +		 * when we repopulate the newly enlarged array.
> > +		 */
> > +		rq_lock_irqsave(rq, &rf);
> > +		populate_new_pref_llcs(rq->nr_pref_llc, new);
> > +		rq->nr_pref_llc = new;
> > +		rq_unlock_irqrestore(rq, &rf);
> > +	}
> > +
> > +release_old:
> > +	/*
> > +	 * Load balance is done under rcu_lock.
> > +	 * Wait for load balance before and during resizing to
> > +	 * be done. They may refer to old nr_pref_llc[]
> > +	 * that hasn't been resized.
> > +	 */
> > +	synchronize_rcu();
> > +	for_each_present_cpu(i)
> > +		kfree(*per_cpu_ptr(tmp_llc_pref, i));
> > +
> > +	free_percpu(tmp_llc_pref);
> > +
> > +	/* succeed and update */
> > +	if (!ret)
> > +		max_llcs = new_max_llcs;
> > +
> > +	return ret;
> > +}
> 
> Would it perhaps be easier to stick this thing in rq->sd rather than in
> rq->nr_pref_llc. That way it automagically switches with the 'new'
> domain. And then, with a bit of care, a singe load-balance pass should
> see a consistent view (there should not be reloads of rq->sd -- which
> will be a bit of an audit I suppose).

We need nr_pref_llc information at the runqueue level because the load balancer 
must identify which specific rq has the largest number of tasks that 
prefer a given destination LLC. If we move the counter to the LLC’s sd 
level, we would only know the aggregate number of tasks in the entire LLC 
that prefer that destination—not which rq they reside on. Without per-rq 
counts, we would not be able to select the correct source rq to pull tasks from.

The only way this could work at the LLC-sd level is if all CPUs within 
the LLC shared a single runqueue, which is not the case today.

Let me know if I understand your comments correctly.

Tim



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 09/23] sched/cache: Count tasks prefering destination LLC in a sched group
  2025-12-10 15:16       ` Peter Zijlstra
@ 2025-12-10 19:00         ` Tim Chen
  2025-12-10 23:50         ` Chen, Yu C
  1 sibling, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-10 19:00 UTC (permalink / raw)
  To: Peter Zijlstra, Chen, Yu C
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, 2025-12-10 at 16:16 +0100, Peter Zijlstra wrote:
> On Wed, Dec 10, 2025 at 11:05:33PM +0900, Chen, Yu C wrote:
> > On 12/10/2025 9:52 PM, Peter Zijlstra wrote:
> > > On Wed, Dec 03, 2025 at 03:07:28PM -0800, Tim Chen wrote:
> > > > During LLC load balancing, tabulate the number of tasks on each runqueue
> > > > that prefer the LLC contains the env->dst_cpu in a sched group.
> > > > 
> > > > For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
> > > > balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
> > > > 2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
> > > > selected as the busiest source to pick tasks from.
> > > > 
> > > > Within a source LLC, the total number of tasks preferring a destination
> > > > LLC is computed by summing counts across all CPUs in that LLC. For
> > > > instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
> > > > LLC3, the total for LLC0 is 3.
> > > > 
> > > > These statistics allow the load balancer to choose tasks from source
> > > > sched groups that best match their preferred LLCs.
> > > > 
> > > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > ---
> > > > 
> > > > Notes:
> > > >      v1->v2:
> > > >          Convert nr_pref_llc array in sg_lb_stats to a single
> > > >          variable as only the dst LLC stat is needed.
> > > >          (K Prateek Nayak)
> > > > 
> > > >   kernel/sched/fair.c | 12 ++++++++++++
> > > >   1 file changed, 12 insertions(+)
> > > > 
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index b0e87616e377..4d7803f69a74 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -10445,6 +10445,9 @@ struct sg_lb_stats {
> > > >   	unsigned int nr_numa_running;
> > > >   	unsigned int nr_preferred_running;
> > > >   #endif
> > > > +#ifdef CONFIG_SCHED_CACHE
> > > > +	unsigned int nr_pref_llc;
> > > > +#endif
> > > 
> > > At this point I have to note that rq->nr_pref_llc seems like a horrible
> > > misnomer, for it being an array, and not an actual number like the
> > > naming suggests.
> > 
> > In the v2 it seems that rq->nr_pref_llc is not an array anymore, it
> 
> From two patches ago:
> 
> +       unsigned int            *nr_pref_llc;
> 
> Its a pointer of some sort.


Perhaps I should used a different name here when I update this patch
for v2.  

rq->nr_pref_llc[]   is an array as it records the number of tasks preferring each LLC.
However
sgs->nr_pref_llc  is a single number representing the number of tasks
preferring the current domain preferring the destination LLC.

Sorry for using the same name that may have created this confusion.

> 
> 
> > indicates
> > the number of tasks that want to be migrated to the env->dst_cpu (dst_llc),
> > because
> > these tasks' preferred LLC are env->dst_cpu(dst_llc). Maybe renaming it to
> > rq->nr_pref_dst_llc?
> 
> Like I said in:
> 
>   https://lkml.kernel.org/r/20251210125114.GS3707891@noisy.programming.kicks-ass.net
> 
> it might make sense to put it in struct sched_domain instead of struct
> rq, since then you can allocate and swap it right along with the rest of
> the domain tree.

Sent a separate reply to that comment to clarify why I think we need nr_pref_llc[] per
run queue.

Tim

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 09/23] sched/cache: Count tasks prefering destination LLC in a sched group
  2025-12-10 15:16       ` Peter Zijlstra
  2025-12-10 19:00         ` Tim Chen
@ 2025-12-10 23:50         ` Chen, Yu C
  1 sibling, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-10 23:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On 12/11/2025 12:16 AM, Peter Zijlstra wrote:
> On Wed, Dec 10, 2025 at 11:05:33PM +0900, Chen, Yu C wrote:
>> On 12/10/2025 9:52 PM, Peter Zijlstra wrote:
>>> On Wed, Dec 03, 2025 at 03:07:28PM -0800, Tim Chen wrote:
>>>> During LLC load balancing, tabulate the number of tasks on each runqueue
>>>> that prefer the LLC contains the env->dst_cpu in a sched group.
>>>>
>>>> For example, consider a system with 4 LLC sched groups (LLC0 to LLC3)
>>>> balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has
>>>> 2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is
>>>> selected as the busiest source to pick tasks from.
>>>>
>>>> Within a source LLC, the total number of tasks preferring a destination
>>>> LLC is computed by summing counts across all CPUs in that LLC. For
>>>> instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring
>>>> LLC3, the total for LLC0 is 3.
>>>>
>>>> These statistics allow the load balancer to choose tasks from source
>>>> sched groups that best match their preferred LLCs.
>>>>
>>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>>> ---
>>>>
>>>> Notes:
>>>>       v1->v2:
>>>>           Convert nr_pref_llc array in sg_lb_stats to a single
>>>>           variable as only the dst LLC stat is needed.
>>>>           (K Prateek Nayak)
>>>>
>>>>    kernel/sched/fair.c | 12 ++++++++++++
>>>>    1 file changed, 12 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index b0e87616e377..4d7803f69a74 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -10445,6 +10445,9 @@ struct sg_lb_stats {
>>>>    	unsigned int nr_numa_running;
>>>>    	unsigned int nr_preferred_running;
>>>>    #endif
>>>> +#ifdef CONFIG_SCHED_CACHE
>>>> +	unsigned int nr_pref_llc;
>>>> +#endif
>>>
>>> At this point I have to note that rq->nr_pref_llc seems like a horrible
>>> misnomer, for it being an array, and not an actual number like the
>>> naming suggests.
>>
>> In the v2 it seems that rq->nr_pref_llc is not an array anymore, it
> 
>  From two patches ago:
> 
> +       unsigned int            *nr_pref_llc;
> 
> Its a pointer of some sort.
> 
> 

Ah I see, I thought it was the variable in the sgs structure.


>> indicates
>> the number of tasks that want to be migrated to the env->dst_cpu (dst_llc),
>> because
>> these tasks' preferred LLC are env->dst_cpu(dst_llc). Maybe renaming it to
>> rq->nr_pref_dst_llc?
> 
> Like I said in:
> 
>    https://lkml.kernel.org/r/20251210125114.GS3707891@noisy.programming.kicks-ass.net
> 
> it might make sense to put it in struct sched_domain instead of struct
> rq, since then you can allocate and swap it right along with the rest of
> the domain tree.

I'll think more about this. Currently the per cpu rq's nr_pref_llc is 
used to
identify the "busiest" runqueue. The busiest runqueue has most threads 
wanted
to be migrated to llc_id(env->dst_cpu), because the threads' preferred 
LLC is
there - in this way, the migration success ratio to the preferred LLC 
would be
higher without breaking the imbalance too much IMHO. So we might have to 
track
the per cpu rq's statistics during enqueue/dequeue. If we put it in the 
domain,
not sure how to track that.

Thanks,
Chenyu



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 02/23] sched/cache: Record per-LLC utilization to guide cache-aware scheduling decisions
  2025-12-10 15:13       ` Peter Zijlstra
@ 2025-12-10 23:58         ` Chen, Yu C
  0 siblings, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-10 23:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tim Chen, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On 12/11/2025 12:13 AM, Peter Zijlstra wrote:
> On Wed, Dec 10, 2025 at 11:02:39PM +0900, Chen, Yu C wrote:
>> On 12/9/2025 8:21 PM, Peter Zijlstra wrote:
>>> On Wed, Dec 03, 2025 at 03:07:21PM -0800, Tim Chen wrote:
>>>
>>>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>>>> index bbcfdf12aa6e..0ba4697d74ba 100644
>>>> --- a/include/linux/sched/topology.h
>>>> +++ b/include/linux/sched/topology.h
>>>> @@ -68,6 +68,10 @@ struct sched_domain_shared {
>>>>    	atomic_t	nr_busy_cpus;
>>>>    	int		has_idle_cores;
>>>>    	int		nr_idle_scan;
>>>> +#ifdef CONFIG_SCHED_CACHE
>>>> +	unsigned long	util_avg;
>>>> +	unsigned long	capacity ____cacheline_aligned_in_smp;
>>>
>>> This cacheline annotation confuses me, see below.
>>>
>>>> +#endif
>>>>    };
>>>>    struct sched_domain {
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index cb82f558dc5b..b9f336300f14 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -9622,6 +9622,29 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
>>>>    	return 0;
>>>>    }
>>>> +#ifdef CONFIG_SCHED_CACHE
>>>> +/* Called from load balancing paths with rcu_read_lock held */
>>>> +static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
>>>> +					 unsigned long *cap)
>>>> +{
>>>> +	struct sched_domain_shared *sd_share;
>>>> +
>>>> +	sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>>>> +	if (!sd_share)
>>>> +		return false;
>>>> +
>>>> +	*util = READ_ONCE(sd_share->util_avg);
>>>> +	*cap = READ_ONCE(sd_share->capacity);
>>>
>>> You placed capacity on a separate line, forcing the above to be 2
>>> distinct lines. That seems... sub-optimal?
>>>
>>
>> The reason capacity was placed in a separate cache line
>> is that writes to capacity are not very frequent(cpu hotplug
>> should not happen too-frequently), while writes to util_avg
>> tend to be relatively frequent(changes frequently).
>> If capacity and util_avg were placed in the same cache line,
>> I’m thinking writes to util_avg might invalidate the entire
>> cache line. This could potentially cause cache misses when
>> capacity is read elsewhere, which might lead to false sharing?
> 
> But its introduced here and already read/written together. Is this not
> premature optimization?

I see, since they are read together, there could be a pre-load of
adjacent cachelines I suppose, I'll remove this cache alignment and
check if there is any performance impact.

Thanks,
Chenyu


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing
  2025-12-03 23:07 ` [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
  2025-12-09 11:12   ` Peter Zijlstra
  2025-12-10  9:37   ` Peter Zijlstra
@ 2025-12-11  9:03   ` Vern Hao
  2025-12-16  6:12     ` Chen, Yu C
       [not found]   ` <fbf52d91-0605-4608-b9cc-e8cc56115fd5@gmail.com>
  3 siblings, 1 reply; 111+ messages in thread
From: Vern Hao @ 2025-12-11  9:03 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel, Vern Hao

Hi, Peter, Chen Yu and Tim:

On 2025/12/4 07:07, Tim Chen wrote:
> From: "Peter Zijlstra (Intel)" <peterz@infradead.org>
>
> Adds infrastructure to enable cache-aware load balancing,
> which improves cache locality by grouping tasks that share resources
> within the same cache domain. This reduces cache misses and improves
> overall data access efficiency.
>
> In this initial implementation, threads belonging to the same process
> are treated as entities that likely share working sets. The mechanism
> tracks per-process CPU occupancy across cache domains and attempts to
> migrate threads toward cache-hot domains where their process already
> has active threads, thereby enhancing locality.
>
> This provides a basic model for cache affinity. While the current code
> targets the last-level cache (LLC), the approach could be extended to
> other domain types such as clusters (L2) or node-internal groupings.
>
> At present, the mechanism selects the CPU within an LLC that has the
> highest recent runtime. Subsequent patches in this series will use this
> information in the load-balancing path to guide task placement toward
> preferred LLCs.
>
> In the future, more advanced policies could be integrated through NUMA
> balancing-for example, migrating a task to its preferred LLC when spare
> capacity exists, or swapping tasks across LLCs to improve cache affinity.
> Grouping of tasks could also be generalized from that of a process
> to be that of a NUMA group, or be user configurable.
>
> Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>
> Notes:
>      v1->v2:
>         Restore the original CPU scan to cover all online CPUs,
>         rather than scanning within the preferred NUMA node.
>         (Peter Zijlstra)
>      
>         Use rq->curr instead of rq->donor. (K Prateek Nayak)
>      
>         Minor fix in task_tick_cache() to use
>         if (mm->mm_sched_epoch >= rq->cpu_epoch)
>         to avoid mm_sched_epoch going backwards.
>
>   include/linux/mm_types.h |  44 +++++++
>   include/linux/sched.h    |  11 ++
>   init/Kconfig             |  11 ++
>   kernel/fork.c            |   6 +
>   kernel/sched/core.c      |   6 +
>   kernel/sched/fair.c      | 258 +++++++++++++++++++++++++++++++++++++++
>   kernel/sched/sched.h     |   8 ++
>   7 files changed, 344 insertions(+)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 90e5790c318f..1ea16ef90566 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -939,6 +939,11 @@ typedef struct {
>   	DECLARE_BITMAP(__mm_flags, NUM_MM_FLAG_BITS);
>   } __private mm_flags_t;
>   
> +struct mm_sched {
> +	u64 runtime;
> +	unsigned long epoch;
> +};
> +
>   struct kioctx_table;
>   struct iommu_mm_data;
>   struct mm_struct {
> @@ -1029,6 +1034,17 @@ struct mm_struct {
>   		 */
>   		raw_spinlock_t cpus_allowed_lock;
>   #endif
> +#ifdef CONFIG_SCHED_CACHE
> +		/*
> +		 * Track per-cpu-per-process occupancy as a proxy for cache residency.
> +		 * See account_mm_sched() and ...
> +		 */
> +		struct mm_sched __percpu *pcpu_sched;
> +		raw_spinlock_t mm_sched_lock;
> +		unsigned long mm_sched_epoch;
> +		int mm_sched_cpu;
As we discussed earlier，I continue to believe that dedicating 
'mm_sched_cpu' to handle the aggregated hotspots of all threads is 
inappropriate, as the multiple threads lack a necessary correlation in 
our real application.

So, I was wondering if we could put this variable into struct 
task_struct, That allows us to better monitor the hotspot CPU of each 
thread, despite some details needing consideration.

> +#endif
> +
>   #ifdef CONFIG_MMU
>   		atomic_long_t pgtables_bytes;	/* size of all page tables */
>   #endif
> @@ -1487,6 +1503,34 @@ static inline unsigned int mm_cid_size(void)
>   static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
>   #endif /* CONFIG_SCHED_MM_CID */
>   
> +#ifdef CONFIG_SCHED_CACHE
> +void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sched);
> +
> +static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
> +{
> +	struct mm_sched __percpu *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
> +
> +	if (!pcpu_sched)
> +		return -ENOMEM;
> +
> +	mm_init_sched(mm, pcpu_sched);
> +	return 0;
> +}
> +
> +#define mm_alloc_sched(...)	alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__))
> +
> +static inline void mm_destroy_sched(struct mm_struct *mm)
> +{
> +	free_percpu(mm->pcpu_sched);
> +	mm->pcpu_sched = NULL;
> +}
> +#else /* !CONFIG_SCHED_CACHE */
> +
> +static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; }
> +static inline void mm_destroy_sched(struct mm_struct *mm) { }
> +
> +#endif /* CONFIG_SCHED_CACHE */
> +
>   struct mmu_gather;
>   extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
>   extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b469878de25c..278b529c91df 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1406,6 +1406,10 @@ struct task_struct {
>   	unsigned long			numa_pages_migrated;
>   #endif /* CONFIG_NUMA_BALANCING */
>   
> +#ifdef CONFIG_SCHED_CACHE
> +	struct callback_head		cache_work;
> +#endif
> +
>   #ifdef CONFIG_RSEQ
>   	struct rseq __user *rseq;
>   	u32 rseq_len;
> @@ -2428,4 +2432,11 @@ extern void migrate_enable(void);
>   
>   DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>   
> +#ifdef CONFIG_SCHED_CACHE
> +static inline bool sched_cache_enabled(void)
> +{
> +	return false;
> +}
> +#endif
> +
>   #endif
> diff --git a/init/Kconfig b/init/Kconfig
> index cab3ad28ca49..88556ef8cfd1 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -983,6 +983,17 @@ config NUMA_BALANCING
>   
>   	  This system will be inactive on UMA systems.
>   
> +config SCHED_CACHE
> +	bool "Cache aware load balance"
> +	default y
> +	depends on SMP
> +	help
> +	  When enabled, the scheduler will attempt to aggregate tasks from
> +	  the same process onto a single Last Level Cache (LLC) domain when
> +	  possible. This improves cache locality by keeping tasks that share
> +	  resources within the same cache domain, reducing cache misses and
> +	  lowering data access latency.
> +
>   config NUMA_BALANCING_DEFAULT_ENABLED
>   	bool "Automatically enable NUMA aware memory/task placement"
>   	default y
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3da0f08615a9..aae5053d1e30 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -680,6 +680,7 @@ void __mmdrop(struct mm_struct *mm)
>   	cleanup_lazy_tlbs(mm);
>   
>   	WARN_ON_ONCE(mm == current->active_mm);
> +	mm_destroy_sched(mm);
>   	mm_free_pgd(mm);
>   	mm_free_id(mm);
>   	destroy_context(mm);
> @@ -1083,6 +1084,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>   	if (mm_alloc_cid(mm, p))
>   		goto fail_cid;
>   
> +	if (mm_alloc_sched(mm))
> +		goto fail_sched;
> +
>   	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
>   				     NR_MM_COUNTERS))
>   		goto fail_pcpu;
> @@ -1092,6 +1096,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>   	return mm;
>   
>   fail_pcpu:
> +	mm_destroy_sched(mm);
> +fail_sched:
>   	mm_destroy_cid(mm);
>   fail_cid:
>   	destroy_context(mm);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f754a60de848..e8bdf03a4b7f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4488,6 +4488,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
>   	p->wake_entry.u_flags = CSD_TYPE_TTWU;
>   	p->migration_pending = NULL;
>   	init_sched_mm_cid(p);
> +	init_sched_mm(p);
>   }
>   
>   DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> @@ -8791,6 +8792,11 @@ void __init sched_init(void)
>   
>   		rq->core_cookie = 0UL;
>   #endif
> +#ifdef CONFIG_SCHED_CACHE
> +		raw_spin_lock_init(&rq->cpu_epoch_lock);
> +		rq->cpu_epoch_next = jiffies;
> +#endif
> +
>   		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
>   	}
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5b752324270b..cb82f558dc5b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1152,6 +1152,8 @@ void post_init_entity_util_avg(struct task_struct *p)
>   	sa->runnable_avg = sa->util_avg;
>   }
>   
> +static inline void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec);
> +
>   static s64 update_se(struct rq *rq, struct sched_entity *se)
>   {
>   	u64 now = rq_clock_task(rq);
> @@ -1174,6 +1176,7 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>   
>   		trace_sched_stat_runtime(running, delta_exec);
>   		account_group_exec_runtime(running, delta_exec);
> +		account_mm_sched(rq, running, delta_exec);
>   
>   		/* cgroup time is always accounted against the donor */
>   		cgroup_account_cputime(donor, delta_exec);
> @@ -1193,6 +1196,259 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>   	return delta_exec;
>   }
>   
> +#ifdef CONFIG_SCHED_CACHE
> +
> +/*
> + * XXX numbers come from a place the sun don't shine -- probably wants to be SD
> + * tunable or so.
> + */
> +#define EPOCH_PERIOD	(HZ / 100)	/* 10 ms */
> +#define EPOCH_LLC_AFFINITY_TIMEOUT	5	/* 50 ms */
> +
> +static int llc_id(int cpu)
> +{
> +	if (cpu < 0)
> +		return -1;
> +
> +	return per_cpu(sd_llc_id, cpu);
> +}
> +
> +void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
> +{
> +	unsigned long epoch;
> +	int i;
> +
> +	for_each_possible_cpu(i) {
> +		struct mm_sched *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
> +		struct rq *rq = cpu_rq(i);
> +
> +		pcpu_sched->runtime = 0;
> +		pcpu_sched->epoch = rq->cpu_epoch;
> +		epoch = rq->cpu_epoch;
> +	}
> +
> +	raw_spin_lock_init(&mm->mm_sched_lock);
> +	mm->mm_sched_epoch = epoch;
> +	mm->mm_sched_cpu = -1;
> +
> +	/*
> +	 * The update to mm->pcpu_sched should not be reordered
> +	 * before initialization to mm's other fields, in case
> +	 * the readers may get invalid mm_sched_epoch, etc.
> +	 */
> +	smp_store_release(&mm->pcpu_sched, _pcpu_sched);
> +}
> +
> +/* because why would C be fully specified */
> +static __always_inline void __shr_u64(u64 *val, unsigned int n)
> +{
> +	if (n >= 64) {
> +		*val = 0;
> +		return;
> +	}
> +	*val >>= n;
> +}
> +
> +static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
> +{
> +	lockdep_assert_held(&rq->cpu_epoch_lock);
> +
> +	unsigned long n, now = jiffies;
> +	long delta = now - rq->cpu_epoch_next;
> +
> +	if (delta > 0) {
> +		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
> +		rq->cpu_epoch += n;
> +		rq->cpu_epoch_next += n * EPOCH_PERIOD;
> +		__shr_u64(&rq->cpu_runtime, n);
> +	}
> +
> +	n = rq->cpu_epoch - pcpu_sched->epoch;
> +	if (n) {
> +		pcpu_sched->epoch += n;
> +		__shr_u64(&pcpu_sched->runtime, n);
> +	}
> +}
> +
> +static unsigned long __no_profile fraction_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
> +{
> +	guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
> +
> +	__update_mm_sched(rq, pcpu_sched);
> +
> +	/*
> +	 * Runtime is a geometric series (r=0.5) and as such will sum to twice
> +	 * the accumulation period, this means the multiplcation here should
> +	 * not overflow.
> +	 */
> +	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
> +}
> +
> +static inline
> +void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> +{
> +	struct mm_struct *mm = p->mm;
> +	struct mm_sched *pcpu_sched;
> +	unsigned long epoch;
> +
> +	if (!sched_cache_enabled())
> +		return;
> +
> +	if (p->sched_class != &fair_sched_class)
> +		return;
> +	/*
> +	 * init_task and kthreads don't having mm
> +	 */
> +	if (!mm || !mm->pcpu_sched)
> +		return;
> +
> +	pcpu_sched = per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq));
> +
> +	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
> +		__update_mm_sched(rq, pcpu_sched);
> +		pcpu_sched->runtime += delta_exec;
> +		rq->cpu_runtime += delta_exec;
> +		epoch = rq->cpu_epoch;
> +	}
> +
> +	/*
> +	 * If this task hasn't hit task_cache_work() for a while, or it
> +	 * has only 1 thread, invalidate its preferred state.
> +	 */
> +	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
> +	    get_nr_threads(p) <= 1) {
> +		if (mm->mm_sched_cpu != -1)
> +			mm->mm_sched_cpu = -1;
> +	}
> +}
> +
> +static void task_tick_cache(struct rq *rq, struct task_struct *p)
> +{
> +	struct callback_head *work = &p->cache_work;
> +	struct mm_struct *mm = p->mm;
> +
> +	if (!sched_cache_enabled())
> +		return;
> +
> +	if (!mm || !mm->pcpu_sched)
> +		return;
> +
> +	/* avoid moving backwards */
> +	if (mm->mm_sched_epoch >= rq->cpu_epoch)
> +		return;
> +
> +	guard(raw_spinlock)(&mm->mm_sched_lock);
> +
> +	if (work->next == work) {
> +		task_work_add(p, work, TWA_RESUME);
> +		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
> +	}
> +}
> +
> +static void __no_profile task_cache_work(struct callback_head *work)
> +{
> +	struct task_struct *p = current;
> +	struct mm_struct *mm = p->mm;
> +	unsigned long m_a_occ = 0;
> +	unsigned long curr_m_a_occ = 0;
> +	int cpu, m_a_cpu = -1;
> +	cpumask_var_t cpus;
> +
> +	WARN_ON_ONCE(work != &p->cache_work);
> +
> +	work->next = work;
> +
> +	if (p->flags & PF_EXITING)
> +		return;
> +
> +	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
> +		return;
> +
> +	scoped_guard (cpus_read_lock) {
> +		cpumask_copy(cpus, cpu_online_mask);
> +
> +		for_each_cpu(cpu, cpus) {
> +			/* XXX sched_cluster_active */
> +			struct sched_domain *sd = per_cpu(sd_llc, cpu);
> +			unsigned long occ, m_occ = 0, a_occ = 0;
> +			int m_cpu = -1, i;
> +
> +			if (!sd)
> +				continue;
> +
> +			for_each_cpu(i, sched_domain_span(sd)) {
> +				occ = fraction_mm_sched(cpu_rq(i),
> +							per_cpu_ptr(mm->pcpu_sched, i));
> +				a_occ += occ;
> +				if (occ > m_occ) {
> +					m_occ = occ;
> +					m_cpu = i;
> +				}
> +			}
> +
> +			/*
> +			 * Compare the accumulated occupancy of each LLC. The
> +			 * reason for using accumulated occupancy rather than average
> +			 * per CPU occupancy is that it works better in asymmetric LLC
> +			 * scenarios.
> +			 * For example, if there are 2 threads in a 4CPU LLC and 3
> +			 * threads in an 8CPU LLC, it might be better to choose the one
> +			 * with 3 threads. However, this would not be the case if the
> +			 * occupancy is divided by the number of CPUs in an LLC (i.e.,
> +			 * if average per CPU occupancy is used).
> +			 * Besides, NUMA balancing fault statistics behave similarly:
> +			 * the total number of faults per node is compared rather than
> +			 * the average number of faults per CPU. This strategy is also
> +			 * followed here.
> +			 */
> +			if (a_occ > m_a_occ) {
> +				m_a_occ = a_occ;
> +				m_a_cpu = m_cpu;
> +			}
> +
> +			if (llc_id(cpu) == llc_id(mm->mm_sched_cpu))
> +				curr_m_a_occ = a_occ;
> +
> +			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
> +		}
> +	}
> +
> +	if (m_a_occ > (2 * curr_m_a_occ)) {
> +		/*
> +		 * Avoid switching mm_sched_cpu too fast.
> +		 * The reason to choose 2X is because:
> +		 * 1. It is better to keep the preferred LLC stable,
> +		 *    rather than changing it frequently and cause migrations
> +		 * 2. 2X means the new preferred LLC has at least 1 more
> +		 *    busy CPU than the old one(200% vs 100%, eg)
> +		 * 3. 2X is chosen based on test results, as it delivers
> +		 *    the optimal performance gain so far.
> +		 */
> +		mm->mm_sched_cpu = m_a_cpu;
> +	}
> +
> +	free_cpumask_var(cpus);
> +}
> +
> +void init_sched_mm(struct task_struct *p)
> +{
> +	struct callback_head *work = &p->cache_work;
> +
> +	init_task_work(work, task_cache_work);
> +	work->next = work;
> +}
> +
> +#else
> +
> +static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
> +				    s64 delta_exec) { }
> +
> +void init_sched_mm(struct task_struct *p) { }
> +
> +static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
> +
> +#endif
> +
>   /*
>    * Used by other classes to account runtime.
>    */
> @@ -13124,6 +13380,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>   	if (static_branch_unlikely(&sched_numa_balancing))
>   		task_tick_numa(rq, curr);
>   
> +	task_tick_cache(rq, curr);
> +
>   	update_misfit_status(curr, rq);
>   	check_update_overutilized_status(task_rq(curr));
>   
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index adfb6e3409d7..84118b522f22 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1194,6 +1194,12 @@ struct rq {
>   	u64			clock_pelt_idle_copy;
>   	u64			clock_idle_copy;
>   #endif
> +#ifdef CONFIG_SCHED_CACHE
> +	raw_spinlock_t		cpu_epoch_lock ____cacheline_aligned;
> +	u64			cpu_runtime;
> +	unsigned long		cpu_epoch;
> +	unsigned long		cpu_epoch_next;
> +#endif
>   
>   	atomic_t		nr_iowait;
>   
> @@ -3819,6 +3825,8 @@ static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) { }
>   static inline void init_sched_mm_cid(struct task_struct *t) { }
>   #endif /* !CONFIG_SCHED_MM_CID */
>   
> +extern void init_sched_mm(struct task_struct *p);
> +
>   extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
>   extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
>   static inline

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter
  2025-12-10 18:49     ` Tim Chen
@ 2025-12-11 10:31       ` Peter Zijlstra
  2025-12-15 19:21         ` Tim Chen
  2025-12-16 22:45         ` Tim Chen
  0 siblings, 2 replies; 111+ messages in thread
From: Peter Zijlstra @ 2025-12-11 10:31 UTC (permalink / raw)
  To: Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Wed, Dec 10, 2025 at 10:49:14AM -0800, Tim Chen wrote:
> On Wed, 2025-12-10 at 13:51 +0100, Peter Zijlstra wrote:
> > On Wed, Dec 03, 2025 at 03:07:26PM -0800, Tim Chen wrote:

> > Would it perhaps be easier to stick this thing in rq->sd rather than in
> > rq->nr_pref_llc. That way it automagically switches with the 'new'
> > domain. And then, with a bit of care, a singe load-balance pass should
> > see a consistent view (there should not be reloads of rq->sd -- which
> > will be a bit of an audit I suppose).
> 
> We need nr_pref_llc information at the runqueue level because the load balancer 
> must identify which specific rq has the largest number of tasks that 
> prefer a given destination LLC. If we move the counter to the LLC’s sd 
> level, we would only know the aggregate number of tasks in the entire LLC 
> that prefer that destination—not which rq they reside on. Without per-rq 
> counts, we would not be able to select the correct source rq to pull tasks from.
> 
> The only way this could work at the LLC-sd level is if all CPUs within 
> the LLC shared a single runqueue, which is not the case today.
> 
> Let me know if I understand your comments correctly.

So the sched_domain instances are per-cpu (hence the need for
sched_domain_shared). So irrespective of what level you stick them at (I
was thinking the bottom most, but it really doesn't matter) they will be
per CPU.


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes
  2025-12-03 23:07 ` [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes Tim Chen
  2025-12-09 12:11   ` Peter Zijlstra
@ 2025-12-12  3:34   ` Vern Hao
  2025-12-15 19:32     ` Tim Chen
  1 sibling, 1 reply; 111+ messages in thread
From: Vern Hao @ 2025-12-12  3:34 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel, Vern Hao


On 2025/12/4 07:07, Tim Chen wrote:
> With cache-aware scheduling enabled, each task is assigned a
> preferred LLC ID. This allows quick identification of the LLC domain
> where the task prefers to run, similar to numa_preferred_nid in
> NUMA balancing.
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>
> Notes:
>      v1->v2: Align preferred LLC with NUMA balancing's preferred node.
>
>   include/linux/sched.h |  1 +
>   init/init_task.c      |  3 +++
>   kernel/sched/fair.c   | 18 ++++++++++++++++++
>   3 files changed, 22 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 278b529c91df..1ad46220cd04 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1408,6 +1408,7 @@ struct task_struct {
>   
>   #ifdef CONFIG_SCHED_CACHE
>   	struct callback_head		cache_work;
> +	int				preferred_llc;
>   #endif
>   
>   #ifdef CONFIG_RSEQ
> diff --git a/init/init_task.c b/init/init_task.c
> index a55e2189206f..44bae72b5b7d 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -191,6 +191,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
>   	.numa_group	= NULL,
>   	.numa_faults	= NULL,
>   #endif
> +#ifdef CONFIG_SCHED_CACHE
> +	.preferred_llc  = -1,
> +#endif
>   #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
>   	.kasan_depth	= 1,
>   #endif
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a3918269906..10cec83f65d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1300,6 +1300,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>   	struct mm_struct *mm = p->mm;
>   	struct mm_sched *pcpu_sched;
>   	unsigned long epoch;
> +	int mm_sched_llc = -1;
>   
>   	if (!sched_cache_enabled())
>   		return;
> @@ -1330,6 +1331,23 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>   		if (mm->mm_sched_cpu != -1)
>   			mm->mm_sched_cpu = -1;
>   	}
> +
> +	if (mm->mm_sched_cpu != -1) {
> +		mm_sched_llc = llc_id(mm->mm_sched_cpu);
> +
> +#ifdef CONFIG_NUMA_BALANCING
> +		/*
> +		 * Don't assign preferred LLC if it
> +		 * conflicts with NUMA balancing.
> +		 */
> +		if (p->numa_preferred_nid >= 0 &&

I wonder if the restriction here shouldn't be so strict. In Mel Gorman's 
patch (e496132ebedd sched/fair: Adjust the allowed NUMA imbalance when 
SD_NUMA spans multiple LLCs), the value of the 'imb_numa_nr' is checked 
to determine if |SD_NUMA| imbalance is allowed. Could we use this same 
check to decide whether or not to perform a cross-numa migration?

> +		    cpu_to_node(mm->mm_sched_cpu) != p->numa_preferred_nid)
> +			mm_sched_llc = -1;
> +#endif
> +	}
> +
> +	if (p->preferred_llc != mm_sched_llc)
> +		p->preferred_llc = mm_sched_llc;
>   }
>   
>   static void task_tick_cache(struct rq *rq, struct task_struct *p)

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter
  2025-12-11 10:31       ` Peter Zijlstra
@ 2025-12-15 19:21         ` Tim Chen
  2025-12-16 22:45         ` Tim Chen
  1 sibling, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-15 19:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Thu, 2025-12-11 at 11:31 +0100, Peter Zijlstra wrote:
> On Wed, Dec 10, 2025 at 10:49:14AM -0800, Tim Chen wrote:
> > On Wed, 2025-12-10 at 13:51 +0100, Peter Zijlstra wrote:
> > > On Wed, Dec 03, 2025 at 03:07:26PM -0800, Tim Chen wrote:
> 
> > > Would it perhaps be easier to stick this thing in rq->sd rather than in
> > > rq->nr_pref_llc. That way it automagically switches with the 'new'
> > > domain. And then, with a bit of care, a singe load-balance pass should
> > > see a consistent view (there should not be reloads of rq->sd -- which
> > > will be a bit of an audit I suppose).
> > 
> > We need nr_pref_llc information at the runqueue level because the load balancer 
> > must identify which specific rq has the largest number of tasks that 
> > prefer a given destination LLC. If we move the counter to the LLC’s sd 
> > level, we would only know the aggregate number of tasks in the entire LLC 
> > that prefer that destination—not which rq they reside on. Without per-rq 
> > counts, we would not be able to select the correct source rq to pull tasks from.
> > 
> > The only way this could work at the LLC-sd level is if all CPUs within 
> > the LLC shared a single runqueue, which is not the case today.
> > 
> > Let me know if I understand your comments correctly.
> 
> So the sched_domain instances are per-cpu (hence the need for
> sched_domain_shared). So irrespective of what level you stick them at (I
> was thinking the bottom most, but it really doesn't matter) they will be
> per CPU.
> 

Okay, I see what you're saying.  Will update code accordingly.

Tim

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes
  2025-12-12  3:34   ` Vern Hao
@ 2025-12-15 19:32     ` Tim Chen
  2025-12-19  4:01       ` Vern Hao
  0 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-15 19:32 UTC (permalink / raw)
  To: Vern Hao, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Fri, 2025-12-12 at 11:34 +0800, Vern Hao wrote:
> On 2025/12/4 07:07, Tim Chen wrote:
> > With cache-aware scheduling enabled, each task is assigned a
> > preferred LLC ID. This allows quick identification of the LLC domain
> > where the task prefers to run, similar to numa_preferred_nid in
> > NUMA balancing.
> > 
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> > 
> > Notes:
> >      v1->v2: Align preferred LLC with NUMA balancing's preferred node.
> > 
> >   include/linux/sched.h |  1 +
> >   init/init_task.c      |  3 +++
> >   kernel/sched/fair.c   | 18 ++++++++++++++++++
> >   3 files changed, 22 insertions(+)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 278b529c91df..1ad46220cd04 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1408,6 +1408,7 @@ struct task_struct {
> >   
> >   #ifdef CONFIG_SCHED_CACHE
> >   	struct callback_head		cache_work;
> > +	int				preferred_llc;
> >   #endif
> >   
> >   #ifdef CONFIG_RSEQ
> > diff --git a/init/init_task.c b/init/init_task.c
> > index a55e2189206f..44bae72b5b7d 100644
> > --- a/init/init_task.c
> > +++ b/init/init_task.c
> > @@ -191,6 +191,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
> >   	.numa_group	= NULL,
> >   	.numa_faults	= NULL,
> >   #endif
> > +#ifdef CONFIG_SCHED_CACHE
> > +	.preferred_llc  = -1,
> > +#endif
> >   #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
> >   	.kasan_depth	= 1,
> >   #endif
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0a3918269906..10cec83f65d5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1300,6 +1300,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> >   	struct mm_struct *mm = p->mm;
> >   	struct mm_sched *pcpu_sched;
> >   	unsigned long epoch;
> > +	int mm_sched_llc = -1;
> >   
> >   	if (!sched_cache_enabled())
> >   		return;
> > @@ -1330,6 +1331,23 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> >   		if (mm->mm_sched_cpu != -1)
> >   			mm->mm_sched_cpu = -1;
> >   	}
> > +
> > +	if (mm->mm_sched_cpu != -1) {
> > +		mm_sched_llc = llc_id(mm->mm_sched_cpu);
> > +
> > +#ifdef CONFIG_NUMA_BALANCING
> > +		/*
> > +		 * Don't assign preferred LLC if it
> > +		 * conflicts with NUMA balancing.
> > +		 */
> > +		if (p->numa_preferred_nid >= 0 &&
> 
> I wonder if the restriction here shouldn't be so strict. In Mel Gorman's 
> patch (e496132ebedd sched/fair: Adjust the allowed NUMA imbalance when 
> SD_NUMA spans multiple LLCs), the value of the 'imb_numa_nr' is checked 
> to determine if |SD_NUMA| imbalance is allowed. Could we use this same 
> check to decide whether or not to perform a cross-numa migration?

If we set the preferred LLC that's in a different node other than the preferred
node, the preferred LLC is going to fight with NUMA balancing and bounce
tasks back and forth between nodes. NUMA locality is going to affect performance
more so we'll let NUMA preference take precedence.

Tim 

> 
> > +		    cpu_to_node(mm->mm_sched_cpu) != p->numa_preferred_nid)
> > +			mm_sched_llc = -1;
> > +#endif
> > +	}
> > +
> > +	if (p->preferred_llc != mm_sched_llc)
> > +		p->preferred_llc = mm_sched_llc;
> >   }
> >   
> >   static void task_tick_cache(struct rq *rq, struct task_struct *p)

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 04/23] sched/cache: Make LLC id continuous
  2025-12-09 11:58   ` Peter Zijlstra
@ 2025-12-15 20:49     ` Tim Chen
  2025-12-16  5:31       ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-15 20:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Tue, 2025-12-09 at 12:58 +0100, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:23PM -0800, Tim Chen wrote:
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 710ed9943d27..0a3918269906 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1210,10 +1210,17 @@ __read_mostly unsigned int llc_imb_pct            = 20;
> >  
> >  static int llc_id(int cpu)
> >  {
> > +	int llc;
> > +
> >  	if (cpu < 0)
> >  		return -1;
> >  
> > +	llc = per_cpu(sd_llc_id, cpu);
> > +	/* avoid race with cpu hotplug */
> > +	if (unlikely(llc >= max_llcs))
> > +		return -1;
> > +
> > +	return llc;
> >  }
> >  
> >  void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
> 
> > @@ -668,6 +670,55 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> >  DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
> >  DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
> >  
> > +/*
> > + * Assign continuous llc id for the CPU, and return
> > + * the assigned llc id.
> > + */
> > +static int update_llc_id(struct sched_domain *sd,
> > +			 int cpu)
> > +{
> > +	int id = per_cpu(sd_llc_id, cpu), i;
> > +
> > +	if (id >= 0)
> > +		return id;
> > +
> > +	if (sd) {
> > +		/* Look for any assigned id and reuse it.*/
> > +		for_each_cpu(i, sched_domain_span(sd)) {
> > +			id = per_cpu(sd_llc_id, i);
> > +
> > +			if (id >= 0) {
> > +				per_cpu(sd_llc_id, cpu) = id;
> > +				return id;
> > +			}
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * When 1. there is no id assigned to this LLC domain,
> > +	 * or 2. the sd is NULL, we reach here.
> > +	 * Consider the following scenario,
> > +	 * CPU0~CPU95 are in the node0, CPU96~CPU191 are
> > +	 * in the node1. During bootup, maxcpus=96 is
> > +	 * appended.
> > +	 * case 1: When running cpu_attach_domain(CPU24)
> > +	 * during boot up, CPU24 is the first CPU in its
> > +	 * non-NULL LLC domain. However,
> > +	 * its corresponding llc id has not been assigned yet.
> > +	 *
> > +	 * case 2: After boot up, the CPU100 is brought up
> > +	 * via sysfs manually. As a result, CPU100 has only a
> > +	 * Numa domain attached, because CPU100 is the only CPU
> > +	 * of a sched domain, all its bottom domains are degenerated.
> > +	 * The LLC domain pointer sd is NULL for CPU100.
> > +	 *
> > +	 * For both cases, we want to increase the number of LLCs.
> > +	 */
> > +	per_cpu(sd_llc_id, cpu) = max_llcs++;
> > +
> > +	return per_cpu(sd_llc_id, cpu);
> > +}
> 
> I'm not sure I follow. So partition_sched_domains() first calls
> detach_destroy_domains() on the old set, and then build_sched_domains()
> on the new set.
> 
> Do detach_destroy_domain() will do:
> 
>   cpu_attach_domain(NULL,..);
> 
> That is, it will explicitly attach the NULL sched_domain to a CPU. At
> which point I feel update_llc_id() should be returning -1, no?
> 
> Then later, build_sched_domains() will set a !NULL sched_domain, at
> which point update_llc_id() can set a real value.
> 
> This should then also get rid of that weird max_llcs check in llc_id(),
> right?

Thanks for pointing this out.  Yes, we should take care of the
attachment of NULL sd. Will update the code accordingly.

Tim

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue
  2025-12-10  9:42       ` Peter Zijlstra
@ 2025-12-16  0:20         ` Chen, Yu C
  0 siblings, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-16  0:20 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On 12/10/2025 5:42 PM, Peter Zijlstra wrote:
> On Tue, Dec 09, 2025 at 02:55:21PM -0800, Tim Chen wrote:
> 
>>>> +static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
>>>> +{
>>>> +	int pref_llc;
>>>> +
>>>> +	/*
>>>> +	 * Borrow the uc_se->active from uclamp_rq_inc_id(),
>>>> +	 * uclamp_rq_dec_id() to avoid the unbalanced calculation
>>>> +	 * of rq statistics.
>>>> +	 */
>>>> +	if (unlikely(!p->sched_llc_active))
>>>> +		return;
>>>
>>> Another very confusing comment; what? Also, can you please explain (in
>>> the new comment) how we get here without having llc_active set?
>>
>> The comment meant to say that we are using a similar mechanism as
>> accounting done in uc_se->active from uclamp_rq_inc_id(). I agree that
>> it confuses more than making things clearer.
>>
>> How about the following comment to make things clearer:
>>
>> 	/*
>> 	 * Cache aware scheduling was active when the task was enqueued.
>> 	 * Admin has disabled cache aware scheduling before task was dequeued
>> 	 * but the accounting has to be kept straight in case cache aware scheduling
>> 	 * is re-enabled.
>> 	 */
> 
> Is having that sched_cache_enabled() test worth it?
> account_numa_{en,de}queue() don't seem to have any of this.
> 
> 

OK, I think we can remove the sched_cache_enabled() check and
make the account_llc_{en,de}queue() depending on CONFIG_SCHED_CACHE,
so the sched_llc_active can be removed.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 12/23] sched/cache: Add migrate_llc_task migration type for cache-aware balancing
  2025-12-10 13:32   ` Peter Zijlstra
@ 2025-12-16  0:52     ` Chen, Yu C
  0 siblings, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-16  0:52 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On 12/10/2025 9:32 PM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:31PM -0800, Tim Chen wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index db555c11b5b8..529adf342ce0 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9547,7 +9547,8 @@ enum migration_type {
>>   	migrate_load = 0,
>>   	migrate_util,
>>   	migrate_task,
>> -	migrate_misfit
>> +	migrate_misfit,
>> +	migrate_llc_task
>>   };
>>   
>>   #define LBF_ALL_PINNED	0x01
>> @@ -10134,6 +10135,10 @@ static int detach_tasks(struct lb_env *env)
>>   			env->imbalance -= util;
>>   			break;
>>   
>> +		case migrate_llc_task:
>> +			env->imbalance--;
>> +			break;
>> +
>>   		case migrate_task:
>>   			env->imbalance--;
>>   			break;
> 
>> @@ -12181,6 +12199,16 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
>>   			}
>>   			break;
>>   
>> +		case migrate_llc_task:
>> +#ifdef CONFIG_SCHED_CACHE
>> +			dst_llc = llc_id(env->dst_cpu);
>> +			if (dst_llc >= 0 &&
>> +			    busiest_pref_llc < rq->nr_pref_llc[dst_llc]) {
>> +				busiest_pref_llc = rq->nr_pref_llc[dst_llc];
>> +				busiest = rq;
>> +			}
>> +#endif
>> +			break;
>>   		case migrate_task:
>>   			if (busiest_nr < nr_running) {
>>   				busiest_nr = nr_running;
>> @@ -12363,6 +12391,8 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
>>   	case migrate_misfit:
>>   		__schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
>>   		break;
>> +	case migrate_llc_task:
>> +		break;
>>   	}
>>   }
> 
> The enum and all switch statements had the same order; you wrecked it!

OK, will fix the order.

Thanks,
Chenyu

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 04/23] sched/cache: Make LLC id continuous
  2025-12-15 20:49     ` Tim Chen
@ 2025-12-16  5:31       ` Chen, Yu C
  2025-12-16 19:53         ` Tim Chen
  0 siblings, 1 reply; 111+ messages in thread
From: Chen, Yu C @ 2025-12-16  5:31 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On 12/16/2025 4:49 AM, Tim Chen wrote:
> On Tue, 2025-12-09 at 12:58 +0100, Peter Zijlstra wrote:
>> On Wed, Dec 03, 2025 at 03:07:23PM -0800, Tim Chen wrote:
>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 710ed9943d27..0a3918269906 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1210,10 +1210,17 @@ __read_mostly unsigned int llc_imb_pct            = 20;
>>>   
>>>   static int llc_id(int cpu)
>>>   {
>>> +	int llc;
>>> +
>>>   	if (cpu < 0)
>>>   		return -1;
>>>   
>>> +	llc = per_cpu(sd_llc_id, cpu);
>>> +	/* avoid race with cpu hotplug */
>>> +	if (unlikely(llc >= max_llcs))
>>> +		return -1;
>>> +
>>> +	return llc;
>>>   }
>>>   
>>>   void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
>>
>>> @@ -668,6 +670,55 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>>   DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
>>>   DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
>>>   
>>> +/*
>>> + * Assign continuous llc id for the CPU, and return
>>> + * the assigned llc id.
>>> + */
>>> +static int update_llc_id(struct sched_domain *sd,
>>> +			 int cpu)
>>> +{
>>> +	int id = per_cpu(sd_llc_id, cpu), i;
>>> +
>>> +	if (id >= 0)
>>> +		return id;
>>> +
>>> +	if (sd) {
>>> +		/* Look for any assigned id and reuse it.*/
>>> +		for_each_cpu(i, sched_domain_span(sd)) {
>>> +			id = per_cpu(sd_llc_id, i);
>>> +
>>> +			if (id >= 0) {
>>> +				per_cpu(sd_llc_id, cpu) = id;
>>> +				return id;
>>> +			}
>>> +		}
>>> +	}
>>> +
>>> +	/*
>>> +	 * When 1. there is no id assigned to this LLC domain,
>>> +	 * or 2. the sd is NULL, we reach here.
>>> +	 * Consider the following scenario,
>>> +	 * CPU0~CPU95 are in the node0, CPU96~CPU191 are
>>> +	 * in the node1. During bootup, maxcpus=96 is
>>> +	 * appended.
>>> +	 * case 1: When running cpu_attach_domain(CPU24)
>>> +	 * during boot up, CPU24 is the first CPU in its
>>> +	 * non-NULL LLC domain. However,
>>> +	 * its corresponding llc id has not been assigned yet.
>>> +	 *
>>> +	 * case 2: After boot up, the CPU100 is brought up
>>> +	 * via sysfs manually. As a result, CPU100 has only a
>>> +	 * Numa domain attached, because CPU100 is the only CPU
>>> +	 * of a sched domain, all its bottom domains are degenerated.
>>> +	 * The LLC domain pointer sd is NULL for CPU100.
>>> +	 *
>>> +	 * For both cases, we want to increase the number of LLCs.
>>> +	 */
>>> +	per_cpu(sd_llc_id, cpu) = max_llcs++;
>>> +
>>> +	return per_cpu(sd_llc_id, cpu);
>>> +}
>>
>> I'm not sure I follow. So partition_sched_domains() first calls
>> detach_destroy_domains() on the old set, and then build_sched_domains()
>> on the new set.
>>
>> Do detach_destroy_domain() will do:
>>
>>    cpu_attach_domain(NULL,..);
>>
>> That is, it will explicitly attach the NULL sched_domain to a CPU. At
>> which point I feel update_llc_id() should be returning -1, no?
>>
>> Then later, build_sched_domains() will set a !NULL sched_domain, at
>> which point update_llc_id() can set a real value.
>>
>> This should then also get rid of that weird max_llcs check in llc_id(),
>> right?

The check for max_llcs was intended to prevent out-of-bounds access
to rq->nr_pref_llc[] at multiple points in the code.
Since dst_llc = llc_id(env->dst_cpu); — and while the LLC ID for the
  CPU is updated in update_llc_id(), this update occurs before we reallocate
  the nr_pref_llc buffer — dst_llc may end up exceeding the bounds of the
original nr_pref_llc buffer.

For this reason, we added a check if (dst_llc > max_llc) in llc_id()
when attempting to access rq->nr_pref_llc[dst_llc].

However, I agree that the max_llc check seems to not properly integrated
into  the current patch: it should instead be placed in the 7th patch, as
this would better illustrate the rationale for the max_llc check here:
sched/cache: Introduce per runqueue task LLC preference counter

In the 7th patch, we actually increment new_max_llcs rather than
max_llcs — meaning max_llcs always represents the "old" number of LLCs.
As a result, there is a race window between extending the rq->nr_pref_llc
buffer and updating max_llcs.


@@ -714,7 +827,7 @@ static int update_llc_id(struct sched_domain *sd,
  	 *
  	 * For both cases, we want to increase the number of LLCs.
  	 */
-	per_cpu(sd_llc_id, cpu) = max_llcs++;
+	per_cpu(sd_llc_id, cpu) = new_max_llcs++;

  	return per_cpu(sd_llc_id, cpu);
  }


> Thanks for pointing this out.  Yes, we should take care of the
> attachment of NULL sd. Will update the code accordingly.
> 

My understanding is that, if the sd is NULL, it is either because invoked
by detach_destroy_domain() for the old set, or by case 2 mentioned in 
above comments:
Say, CPU0-CPU95 are online during bootup, the boot command line is 
maxcpus=96.
Later after bootup, the user wants to bring up CPU100, the LLC domain for
CPU100 is NULL in this case(due to sd generation), and a new LLC should be
detected.

That is to say, when we reach update_llc_id(), there could be 2 reasons
for NULL sd. For the detach_destroy_domain() case, update_llc_id()
should return a valid id without increasing the max_llcs, because of
     if (id >= 0)
         return id;
And for the latter, the max_llcs should be increased.
Let me double check on this.

thanks,
Chenyu


> Tim

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing
  2025-12-11  9:03   ` Vern Hao
@ 2025-12-16  6:12     ` Chen, Yu C
  2025-12-17  1:17       ` Vern Hao
  0 siblings, 1 reply; 111+ messages in thread
From: Chen, Yu C @ 2025-12-16  6:12 UTC (permalink / raw)
  To: Vern Hao
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Peter Zijlstra,
	Ingo Molnar, K Prateek Nayak, Vincent Guittot, Gautham R . Shenoy,
	Tim Chen

On 12/11/2025 5:03 PM, Vern Hao wrote:
> Hi, Peter, Chen Yu and Tim:
> 
> On 2025/12/4 07:07, Tim Chen wrote:
>> From: "Peter Zijlstra (Intel)" <peterz@infradead.org>
>>
>> Adds infrastructure to enable cache-aware load balancing,
>> which improves cache locality by grouping tasks that share resources
>> within the same cache domain. This reduces cache misses and improves
>> overall data access efficiency.
>>
>> In this initial implementation, threads belonging to the same process
>> are treated as entities that likely share working sets. The mechanism
>> tracks per-process CPU occupancy across cache domains and attempts to
>> migrate threads toward cache-hot domains where their process already
>> has active threads, thereby enhancing locality.
>>
>> This provides a basic model for cache affinity. While the current code
>> targets the last-level cache (LLC), the approach could be extended to
>> other domain types such as clusters (L2) or node-internal groupings.
>>
>> At present, the mechanism selects the CPU within an LLC that has the
>> highest recent runtime. Subsequent patches in this series will use this
>> information in the load-balancing path to guide task placement toward
>> preferred LLCs.
>>
>> In the future, more advanced policies could be integrated through NUMA
>> balancing-for example, migrating a task to its preferred LLC when spare
>> capacity exists, or swapping tasks across LLCs to improve cache affinity.
>> Grouping of tasks could also be generalized from that of a process
>> to be that of a NUMA group, or be user configurable.
>>
>> Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>
>> Notes:
>>      v1->v2:
>>         Restore the original CPU scan to cover all online CPUs,
>>         rather than scanning within the preferred NUMA node.
>>         (Peter Zijlstra)
>>         Use rq->curr instead of rq->donor. (K Prateek Nayak)
>>         Minor fix in task_tick_cache() to use
>>         if (mm->mm_sched_epoch >= rq->cpu_epoch)
>>         to avoid mm_sched_epoch going backwards.
>>
>>   include/linux/mm_types.h |  44 +++++++
>>   include/linux/sched.h    |  11 ++
>>   init/Kconfig             |  11 ++
>>   kernel/fork.c            |   6 +
>>   kernel/sched/core.c      |   6 +
>>   kernel/sched/fair.c      | 258 +++++++++++++++++++++++++++++++++++++++
>>   kernel/sched/sched.h     |   8 ++
>>   7 files changed, 344 insertions(+)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 90e5790c318f..1ea16ef90566 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -939,6 +939,11 @@ typedef struct {
>>       DECLARE_BITMAP(__mm_flags, NUM_MM_FLAG_BITS);
>>   } __private mm_flags_t;
>> +struct mm_sched {
>> +    u64 runtime;
>> +    unsigned long epoch;
>> +};
>> +
>>   struct kioctx_table;
>>   struct iommu_mm_data;
>>   struct mm_struct {
>> @@ -1029,6 +1034,17 @@ struct mm_struct {
>>            */
>>           raw_spinlock_t cpus_allowed_lock;
>>   #endif
>> +#ifdef CONFIG_SCHED_CACHE
>> +        /*
>> +         * Track per-cpu-per-process occupancy as a proxy for cache 
>> residency.
>> +         * See account_mm_sched() and ...
>> +         */
>> +        struct mm_sched __percpu *pcpu_sched;
>> +        raw_spinlock_t mm_sched_lock;
>> +        unsigned long mm_sched_epoch;
>> +        int mm_sched_cpu;
> As we discussed earlier，I continue to believe that dedicating 
> 'mm_sched_cpu' to handle the aggregated hotspots of all threads is 
> inappropriate, as the multiple threads lack a necessary correlation in 
> our real application.
> 
> So, I was wondering if we could put this variable into struct 
> task_struct, That allows us to better monitor the hotspot CPU of each 
> thread, despite some details needing consideration.
> 

I suppose you are suggesting a fine-grained control for a set of tasks.
Process-scope aggregation could be a start as the default strategy(
conservative, benefit multi-thread workloads that share data per process,
not introduce regression).

On top of that, I wonder if we could provide task-scope control like
sched_setattr(), similar to core-scheduling cookie mechanism, for
users that want aggressive aggregation. But before doing that, we need a
mechanism that that leverages a monitor system(like PMU) to figure out
if putting these tasks together would bring benefit(if I understand
Steven's suggestion correctly on LPC), or detection tasks that share
resource, then maybe leverage QOS interfaces to enable the cache-aware
aggregation(something Qias mentioned on the LPC).

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 15/23] sched/cache: Respect LLC preference in task migration and detach
  2025-12-10 16:30   ` Peter Zijlstra
@ 2025-12-16  7:30     ` Chen, Yu C
  0 siblings, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-16  7:30 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On 12/11/2025 12:30 AM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:34PM -0800, Tim Chen wrote:
> 
>> @@ -10025,6 +10025,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>   	if (env->flags & LBF_ACTIVE_LB)
>>   		return 1;
>>   
>> +#ifdef CONFIG_SCHED_CACHE
>> +	if (sched_cache_enabled() &&
>> +	    can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) == mig_forbid &&
>> +	    !task_has_sched_core(p))
>> +		return 0;
>> +#endif
> 
> This seems wrong:
>   - it does not let nr_balance_failed override things;
>   - it takes precedence over migrate_degrade_locality(); you really want
>     to migrate towards the preferred NUMA node over staying on your LLC.
> 
> That is, this really wants to be done after migrate_degrades_locality()
> and only if degrades == 0 or something.
> 

OK, will fix it.

>>   	degrades = migrate_degrades_locality(p, env);
>>   	if (!degrades)
>>   		hot = task_hot(p, env);
>> @@ -10146,12 +10153,55 @@ static struct list_head
>>   	list_splice(&pref_old_llc, tasks);
>>   	return tasks;
>>   }
>> +
>> +static bool stop_migrate_src_rq(struct task_struct *p,
>> +				struct lb_env *env,
>> +				int detached)
>> +{
>> +	if (!sched_cache_enabled() || p->preferred_llc == -1 ||
>> +	    cpus_share_cache(env->src_cpu, env->dst_cpu) ||
>> +	    env->sd->nr_balance_failed)
>> +		return false;
> 
> But you are allowing nr_balance_failed to override things here.
> 
>> +	/*
>> +	 * Stop migration for the src_rq and pull from a
>> +	 * different busy runqueue in the following cases:
>> +	 *
>> +	 * 1. Trying to migrate task to its preferred
>> +	 *    LLC, but the chosen task does not prefer dest
>> +	 *    LLC - case 3 in order_tasks_by_llc(). This violates
>> +	 *    the goal of migrate_llc_task. However, we should
>> +	 *    stop detaching only if some tasks have been detached
>> +	 *    and the imbalance has been mitigated.
>> +	 *
>> +	 * 2. Don't detach more tasks if the remaining tasks want
>> +	 *    to stay. We know the remaining tasks all prefer the
>> +	 *    current LLC, because after order_tasks_by_llc(), the
>> +	 *    tasks that prefer the current LLC are the least favored
>> +	 *    candidates to be migrated out.
>> +	 */
>> +	if (env->migration_type == migrate_llc_task &&
>> +	    detached && llc_id(env->dst_cpu) != p->preferred_llc)
>> +		return true;
>> +
>> +	if (llc_id(env->src_cpu) == p->preferred_llc)
>> +		return true;
>> +
>> +	return false;
>> +}
> 
> Also, I think we have a problem with nr_balance_failed, cache_nice_tries
> is 1 for SHARE_LLC; this means for failed=0 we ignore:
> 
>   - ineligible tasks
>   - llc fail
>   - node-degrading / hot
> 
> and then the very next round, we do all of them at once, without much
> grading.
> 

Do you mean we can set different thresholds for the different
scenarios you mentioned above, so as to avoid migrating tasks
at the same time in detach_tasks()?

For example,

ineligible tasks check:
if (env->sd->nr_balance_failed > env->sd->cache_nice_tries)
     can_migrate;

llc fail check:
if (env->sd->nr_balance_failed > env->sd->cache_nice_tries + 1)
     can_migrate;

node-degrading/hot check:
if (env->sd->nr_balance_failed > env->sd->cache_nice_tries + 2)
     can_migrate;


>> @@ -10205,6 +10255,15 @@ static int detach_tasks(struct lb_env *env)
>>   
>>   		p = list_last_entry(tasks, struct task_struct, se.group_node);
>>   
>> +		/*
>> +		 * Check if detaching current src_rq should be stopped, because
>> +		 * doing so would break cache aware load balance. If we stop
>> +		 * here, the env->flags has LBF_ALL_PINNED, which would cause
>> +		 * the load balance to pull from another busy runqueue.
> 
> Uhh, can_migrate_task() will clear that ALL_PINNED thing if we've found
> at least one task before getting here.
> 

One problem is that, LBF_ALL_PINNED was cleared before
migrate_degrades_locality()/can_migrate_llc_task() in detach_tasks().
I suppose we want to keep LBF_ALL_PINNED() if can_migrate_llc_task(break
llc locality) failed.

>> +		 */
>> +		if (stop_migrate_src_rq(p, env, detached))
>> +			break;
> 
> 
> Perhaps split cfs_tasks into multiple lists from the get-go? That avoids
> this sorting.

Will check with Tim on this.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 16/23] sched/cache: Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node
  2025-12-10 16:32   ` Peter Zijlstra
  2025-12-10 16:52     ` Peter Zijlstra
@ 2025-12-16  7:31     ` Chen, Yu C
  1 sibling, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-16  7:31 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel, Libo Chen

On 12/11/2025 12:32 AM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:35PM -0800, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> Cache-aware load balancing should only be enabled if there are more
>> than 1 LLCs within 1 NUMA node. sched_cache_present is introduced to
>> indicate whether this platform supports this topology.
>>
>> Suggested-by: Libo Chen <libo.chen@oracle.com>
>> Suggested-by: Adam Li <adamli@os.amperecomputing.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>
>> Notes:
>>      v1->v2:
>>      	Use flag sched_cache_present to indicate whether a platform
>>      	supports cache aware scheduling. Change this flag from staic key.
>>      	There should be only 1 static key to control the cache aware
>>      	scheduling. (Peter Zijlstra)
>>
>>   kernel/sched/topology.c | 20 +++++++++++++++-----
>>   1 file changed, 15 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index d583399fc6a1..9799e3a9a609 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -24,6 +24,8 @@ int max_llcs;
>>   
>>   #ifdef CONFIG_SCHED_CACHE
>>   
>> +static bool sched_cache_present;
> 
> sched_energy_present
> sched_asym_cpucapacity
> sched_cluster_active
> sched_smt_present
> 
> are all static keys tied to the current topology, why break the streak
> and make this a boolean?

OK, will convert it into a key.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 16/23] sched/cache: Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node
  2025-12-10 16:52     ` Peter Zijlstra
@ 2025-12-16  7:36       ` Chen, Yu C
  0 siblings, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-16  7:36 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel, Libo Chen

On 12/11/2025 12:52 AM, Peter Zijlstra wrote:
> On Wed, Dec 10, 2025 at 05:32:35PM +0100, Peter Zijlstra wrote:
>> On Wed, Dec 03, 2025 at 03:07:35PM -0800, Tim Chen wrote:
>>> From: Chen Yu <yu.c.chen@intel.com>
>>>
>>> Cache-aware load balancing should only be enabled if there are more
>>> than 1 LLCs within 1 NUMA node. sched_cache_present is introduced to
>>> indicate whether this platform supports this topology.
>>>
>>> Suggested-by: Libo Chen <libo.chen@oracle.com>
>>> Suggested-by: Adam Li <adamli@os.amperecomputing.com>
>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> ---
>>>
>>> Notes:
>>>      v1->v2:
>>>      	Use flag sched_cache_present to indicate whether a platform
>>>      	supports cache aware scheduling. Change this flag from staic key.
>>>      	There should be only 1 static key to control the cache aware
>>>      	scheduling. (Peter Zijlstra)
>>>
>>>   kernel/sched/topology.c | 20 +++++++++++++++-----
>>>   1 file changed, 15 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index d583399fc6a1..9799e3a9a609 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -24,6 +24,8 @@ int max_llcs;
>>>   
>>>   #ifdef CONFIG_SCHED_CACHE
>>>   
>>> +static bool sched_cache_present;
>>
>> sched_energy_present
>> sched_asym_cpucapacity
>> sched_cluster_active
>> sched_smt_present
>>
>> are all static keys tied to the current topology, why break the streak
>> and make this a boolean?
> 
> Also, patch doesn't use sched_cache_present at all, so perhaps just drop
> it on the floor entirely?

The sched_cache_present flag is used in a subsequent patch:
"[20/23] sched/cache: Add user control to adjust the parameters of 
cache-aware
scheduling" This flag is used to check whether the user is eligible to 
enable
cache-aware scheduling. I will try to place the declaration and usage of 
this
flag together.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling
  2025-12-10 16:51   ` Peter Zijlstra
@ 2025-12-16  7:40     ` Chen, Yu C
  0 siblings, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-16  7:40 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On 12/11/2025 12:51 AM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:36PM -0800, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> A performance regression was observed by Prateek when running hackbench
>> with many threads per process (high fd count). To avoid this, processes
>> with a large number of active threads are excluded from cache-aware
>> scheduling.
>>
>> With sched_cache enabled, record the number of active threads in each
>> process during the periodic task_cache_work(). While iterating over
>> CPUs, if the currently running task belongs to the same process as the
>> task that launched task_cache_work(), increment the active thread count.
>>
>> This number will be used by subsequent patch to inhibit cache aware
>> load balance.
>>
>> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>
>> Notes:
>>      v1->v2: No change.
>>
>>   include/linux/mm_types.h |  1 +
>>   kernel/sched/fair.c      | 11 +++++++++--
>>   2 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 1ea16ef90566..04743983de4d 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1043,6 +1043,7 @@ struct mm_struct {
>>   		raw_spinlock_t mm_sched_lock;
>>   		unsigned long mm_sched_epoch;
>>   		int mm_sched_cpu;
>> +		u64 nr_running_avg ____cacheline_aligned_in_smp;
> 
> This is unlikely to do what you hope it does, it will place this
> variable on a new cacheline, but will not ensure this variable is the
> only one in that line. Notably ogtables_bytes (the next field in this
> structure) will share the line.
> 
> It might all be less dodgy if you stick these here fields in their own
> structure, a little like mm_mm_cid or so.
> 

Got it, will do.

>>   #endif
>>   
>>   #ifdef CONFIG_MMU
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 580a967efdac..2f38ad82688f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1421,11 +1421,11 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
>>   
>>   static void __no_profile task_cache_work(struct callback_head *work)
>>   {
>> -	struct task_struct *p = current;
>> +	struct task_struct *p = current, *cur;
>>   	struct mm_struct *mm = p->mm;
>>   	unsigned long m_a_occ = 0;
>>   	unsigned long curr_m_a_occ = 0;
>> -	int cpu, m_a_cpu = -1;
>> +	int cpu, m_a_cpu = -1, nr_running = 0;
>>   	cpumask_var_t cpus;
>>   
>>   	WARN_ON_ONCE(work != &p->cache_work);
>> @@ -1458,6 +1458,12 @@ static void __no_profile task_cache_work(struct callback_head *work)
>>   					m_occ = occ;
>>   					m_cpu = i;
>>   				}
> 
> 	guard(rcu)();
> 

OK.

>> +				rcu_read_lock();
>> +				cur = rcu_dereference(cpu_rq(i)->curr);
>> +				if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
>> +				    cur->mm == mm)
>> +					nr_running++;
>> +				rcu_read_unlock();
>>   			}
>>   
>>   			/*
>> @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
>>   		mm->mm_sched_cpu = m_a_cpu;
>>   	}
>>   
>> +	update_avg(&mm->nr_running_avg, nr_running);
>>   	free_cpumask_var(cpus);
>>   }
> 
> Its a wee bit weird to introduce nr_running_avg without its user. Makes
> it hard to see what's what.

OK, will put the user together.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling
  2025-12-10 17:02   ` Peter Zijlstra
@ 2025-12-16  7:42     ` Chen, Yu C
  0 siblings, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-16  7:42 UTC (permalink / raw)
  To: Peter Zijlstra, Tim Chen
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On 12/11/2025 1:02 AM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:39PM -0800, Tim Chen wrote:
> 
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 466ba8b7398c..95bf080bbbf0 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
>>   DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>>   
>>   #ifdef CONFIG_SCHED_CACHE
>> +DECLARE_STATIC_KEY_FALSE(sched_cache_on);
>> +
>>   static inline bool sched_cache_enabled(void)
>>   {
>> -	return false;
>> +	return static_branch_unlikely(&sched_cache_on);
>>   }
>>   #endif
>>   
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 02e16b70a790..cde324672103 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops = {
>>   	.release	= single_release,
>>   };
>>   
>> +#ifdef CONFIG_SCHED_CACHE
>> +#define SCHED_CACHE_CREATE_CONTROL(name, max)			  \
>> +static ssize_t sched_cache_write_##name(struct file *filp,	  \
>> +					const char __user *ubuf,  \
>> +					size_t cnt, loff_t *ppos) \
>> +{								  \
>> +	char buf[16];						  \
>> +	unsigned int val;					  \
>> +	if (cnt > 15)						  \
>> +		cnt = 15;					  \
>> +	if (copy_from_user(&buf, ubuf, cnt))			  \
>> +		return -EFAULT;					  \
>> +	buf[cnt] = '\0';					  \
> 
> 
>> +	if (kstrtouint(buf, 10, &val))				  \
>> +		return -EINVAL;					  \
>> +	if (val > (max))						  \
>> +		return -EINVAL;					  \
>> +	llc_##name = val;					  \
>> +	if (!strcmp(#name, "enabled"))				  \
>> +		sched_cache_set(false);				  \
> 
> Oh gawd :-(
> 
> Please just write out all the various write methods and use
> kstrtoul_from_user() and kstrtobool_from_user() where applicable.
> 

OK, will do.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 04/23] sched/cache: Make LLC id continuous
  2025-12-16  5:31       ` Chen, Yu C
@ 2025-12-16 19:53         ` Tim Chen
  2025-12-17  5:25           ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: Tim Chen @ 2025-12-16 19:53 UTC (permalink / raw)
  To: Chen, Yu C, Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Tue, 2025-12-16 at 13:31 +0800, Chen, Yu C wrote:
> On 12/16/2025 4:49 AM, Tim Chen wrote:
> > On Tue, 2025-12-09 at 12:58 +0100, Peter Zijlstra wrote:
> > > On Wed, Dec 03, 2025 at 03:07:23PM -0800, Tim Chen wrote:
> > > 
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 710ed9943d27..0a3918269906 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -1210,10 +1210,17 @@ __read_mostly unsigned int llc_imb_pct            = 20;
> > > >   
> > > >   static int llc_id(int cpu)
> > > >   {
> > > > +	int llc;
> > > > +
> > > >   	if (cpu < 0)
> > > >   		return -1;
> > > >   
> > > > +	llc = per_cpu(sd_llc_id, cpu);
> > > > +	/* avoid race with cpu hotplug */
> > > > +	if (unlikely(llc >= max_llcs))
> > > > +		return -1;
> > > > +
> > > > +	return llc;
> > > >   }
> > > >   
> > > >   void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
> > > 
> > > > @@ -668,6 +670,55 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> > > >   DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
> > > >   DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
> > > >   
> > > > +/*
> > > > + * Assign continuous llc id for the CPU, and return
> > > > + * the assigned llc id.
> > > > + */
> > > > +static int update_llc_id(struct sched_domain *sd,
> > > > +			 int cpu)
> > > > +{
> > > > +	int id = per_cpu(sd_llc_id, cpu), i;
> > > > +
> > > > +	if (id >= 0)
> > > > +		return id;
> > > > +
> > > > +	if (sd) {
> > > > +		/* Look for any assigned id and reuse it.*/
> > > > +		for_each_cpu(i, sched_domain_span(sd)) {
> > > > +			id = per_cpu(sd_llc_id, i);
> > > > +
> > > > +			if (id >= 0) {
> > > > +				per_cpu(sd_llc_id, cpu) = id;
> > > > +				return id;
> > > > +			}
> > > > +		}
> > > > +	}
> > > > +
> > > > +	/*
> > > > +	 * When 1. there is no id assigned to this LLC domain,
> > > > +	 * or 2. the sd is NULL, we reach here.
> > > > +	 * Consider the following scenario,
> > > > +	 * CPU0~CPU95 are in the node0, CPU96~CPU191 are
> > > > +	 * in the node1. During bootup, maxcpus=96 is
> > > > +	 * appended.
> > > > +	 * case 1: When running cpu_attach_domain(CPU24)
> > > > +	 * during boot up, CPU24 is the first CPU in its
> > > > +	 * non-NULL LLC domain. However,
> > > > +	 * its corresponding llc id has not been assigned yet.
> > > > +	 *
> > > > +	 * case 2: After boot up, the CPU100 is brought up
> > > > +	 * via sysfs manually. As a result, CPU100 has only a
> > > > +	 * Numa domain attached, because CPU100 is the only CPU
> > > > +	 * of a sched domain, all its bottom domains are degenerated.
> > > > +	 * The LLC domain pointer sd is NULL for CPU100.
> > > > +	 *
> > > > +	 * For both cases, we want to increase the number of LLCs.
> > > > +	 */
> > > > +	per_cpu(sd_llc_id, cpu) = max_llcs++;
> > > > +
> > > > +	return per_cpu(sd_llc_id, cpu);
> > > > +}
> > > 
> > > I'm not sure I follow. So partition_sched_domains() first calls
> > > detach_destroy_domains() on the old set, and then build_sched_domains()
> > > on the new set.
> > > 
> > > Do detach_destroy_domain() will do:
> > > 
> > >    cpu_attach_domain(NULL,..);
> > > 
> > > That is, it will explicitly attach the NULL sched_domain to a CPU. At
> > > which point I feel update_llc_id() should be returning -1, no?
> > > 
> > > Then later, build_sched_domains() will set a !NULL sched_domain, at
> > > which point update_llc_id() can set a real value.
> > > 
> > > This should then also get rid of that weird max_llcs check in llc_id(),
> > > right?
> 
> The check for max_llcs was intended to prevent out-of-bounds access
> to rq->nr_pref_llc[] at multiple points in the code.
> Since dst_llc = llc_id(env->dst_cpu); — and while the LLC ID for the
>   CPU is updated in update_llc_id(), this update occurs before we reallocate
>   the nr_pref_llc buffer — dst_llc may end up exceeding the bounds of the
> original nr_pref_llc buffer.
> 
> For this reason, we added a check if (dst_llc > max_llc) in llc_id()
> when attempting to access rq->nr_pref_llc[dst_llc].
> 
> However, I agree that the max_llc check seems to not properly integrated
> into  the current patch: it should instead be placed in the 7th patch, as
> this would better illustrate the rationale for the max_llc check here:
> sched/cache: Introduce per runqueue task LLC preference counter
> 
> In the 7th patch, we actually increment new_max_llcs rather than
> max_llcs — meaning max_llcs always represents the "old" number of LLCs.
> As a result, there is a race window between extending the rq->nr_pref_llc
> buffer and updating max_llcs.
> 
> 
> @@ -714,7 +827,7 @@ static int update_llc_id(struct sched_domain *sd,
>   	 *
>   	 * For both cases, we want to increase the number of LLCs.
>   	 */
> -	per_cpu(sd_llc_id, cpu) = max_llcs++;
> +	per_cpu(sd_llc_id, cpu) = new_max_llcs++;
> 
>   	return per_cpu(sd_llc_id, cpu);
>   }
> 
> 
> > Thanks for pointing this out.  Yes, we should take care of the
> > attachment of NULL sd. Will update the code accordingly.
> > 
> 
> My understanding is that, if the sd is NULL, it is either because invoked
> by detach_destroy_domain() for the old set, or by case 2 mentioned in 
> above comments:
> Say, CPU0-CPU95 are online during bootup, the boot command line is 
> maxcpus=96.
> Later after bootup, the user wants to bring up CPU100, the LLC domain for
> CPU100 is NULL in this case(due to sd generation), and a new LLC should be
> detected.
> 
> That is to say, when we reach update_llc_id(), there could be 2 reasons
> for NULL sd. For the detach_destroy_domain() case, update_llc_id()
> should return a valid id without increasing the max_llcs, because of
>      if (id >= 0)
>          return id;
> And for the latter, the max_llcs should be increased.
> Let me double check on this.

The issue is we could offline all CPUs in a LLC and online them later.
In the current code, we will assign their ids all to -1. So on attach
of CPUs again, we'll be assigning a new LLC.  I think the proper thing
to do is not to assign llc id of the offlined cpu (the case where sd == NULL)
and keep the original llc id assigned.  Then we should be okay and not
increase max_llcs.

Tim

> 
> thanks,
> Chenyu
> 
> 
> > Tim

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing
       [not found]   ` <fbf52d91-0605-4608-b9cc-e8cc56115fd5@gmail.com>
@ 2025-12-16 22:30     ` Tim Chen
  0 siblings, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-16 22:30 UTC (permalink / raw)
  To: Vern Hao, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Thu, 2025-12-11 at 16:42 +0800, Vern Hao wrote:
>  
> 
> 
> 
> 
snip

> > +		struct mm_sched __percpu *pcpu_sched;
> > +		raw_spinlock_t mm_sched_lock;
> > +		unsigned long mm_sched_epoch;
> > +		int mm_sched_cpu;
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> As we discussed earlier，I continue to believe that dedicating 'mm_sched_cpu' to handle the aggregated hotspots of all threads is inappropriate, as the multiple threads lack a necessary correlation in our real application. 
>  
> So, I was wondering if we could put this variable into struct task_struct, That allows us to better monitor the hotspot CPU of each thread, despite some details needing consideration.
>  
> 
> 
Vern,
The stat is related to the group of threads(tasks) that we will like to group
together in an LLC. In current implementation, it is per process and hence
the placement in mm struct.

Later, we could change the grouping to other criteria (e.g. cgroup, or numa_group).  In that
case the stats may be associated with other data struct that represent the
group.  As I mentioned in the cover letter, that would be another exercise and we'll
like to get the basic grouping by process to be merged first before considering other
kinds of grouping.

Tim

> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter
  2025-12-11 10:31       ` Peter Zijlstra
  2025-12-15 19:21         ` Tim Chen
@ 2025-12-16 22:45         ` Tim Chen
  1 sibling, 0 replies; 111+ messages in thread
From: Tim Chen @ 2025-12-16 22:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On Thu, 2025-12-11 at 11:31 +0100, Peter Zijlstra wrote:
> On Wed, Dec 10, 2025 at 10:49:14AM -0800, Tim Chen wrote:
> > On Wed, 2025-12-10 at 13:51 +0100, Peter Zijlstra wrote:
> > > On Wed, Dec 03, 2025 at 03:07:26PM -0800, Tim Chen wrote:
> 
> > > Would it perhaps be easier to stick this thing in rq->sd rather than in
> > > rq->nr_pref_llc. That way it automagically switches with the 'new'
> > > domain. And then, with a bit of care, a singe load-balance pass should
> > > see a consistent view (there should not be reloads of rq->sd -- which
> > > will be a bit of an audit I suppose).
> > 
> > We need nr_pref_llc information at the runqueue level because the load balancer 
> > must identify which specific rq has the largest number of tasks that 
> > prefer a given destination LLC. If we move the counter to the LLC’s sd 
> > level, we would only know the aggregate number of tasks in the entire LLC 
> > that prefer that destination—not which rq they reside on. Without per-rq 
> > counts, we would not be able to select the correct source rq to pull tasks from.
> > 
> > The only way this could work at the LLC-sd level is if all CPUs within 
> > the LLC shared a single runqueue, which is not the case today.
> > 
> > Let me know if I understand your comments correctly.
> 
> So the sched_domain instances are per-cpu (hence the need for
> sched_domain_shared). So irrespective of what level you stick them at (I
> was thinking the bottom most, but it really doesn't matter) they will be
> per CPU.

One side effect of that is when rebuild_sched_domains() got triggered, all
rq->sd is getting reallocated. So we'll lose the old LLC preferences till
we have time to re-sample process occupancy. I think it is okay as long
as the call to rebuild_sched_domains() too frequently.  Is this assumption
correct?

Tim 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing
  2025-12-16  6:12     ` Chen, Yu C
@ 2025-12-17  1:17       ` Vern Hao
  0 siblings, 0 replies; 111+ messages in thread
From: Vern Hao @ 2025-12-17  1:17 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Peter Zijlstra,
	Ingo Molnar, K Prateek Nayak, Vincent Guittot, Gautham R . Shenoy,
	Tim Chen


On 2025/12/16 14:12, Chen, Yu C wrote:
> On 12/11/2025 5:03 PM, Vern Hao wrote:
>> Hi, Peter, Chen Yu and Tim:
>>
>> On 2025/12/4 07:07, Tim Chen wrote:
>>> From: "Peter Zijlstra (Intel)" <peterz@infradead.org>
>>>
>>> Adds infrastructure to enable cache-aware load balancing,
>>> which improves cache locality by grouping tasks that share resources
>>> within the same cache domain. This reduces cache misses and improves
>>> overall data access efficiency.
>>>
>>> In this initial implementation, threads belonging to the same process
>>> are treated as entities that likely share working sets. The mechanism
>>> tracks per-process CPU occupancy across cache domains and attempts to
>>> migrate threads toward cache-hot domains where their process already
>>> has active threads, thereby enhancing locality.
>>>
>>> This provides a basic model for cache affinity. While the current code
>>> targets the last-level cache (LLC), the approach could be extended to
>>> other domain types such as clusters (L2) or node-internal groupings.
>>>
>>> At present, the mechanism selects the CPU within an LLC that has the
>>> highest recent runtime. Subsequent patches in this series will use this
>>> information in the load-balancing path to guide task placement toward
>>> preferred LLCs.
>>>
>>> In the future, more advanced policies could be integrated through NUMA
>>> balancing-for example, migrating a task to its preferred LLC when spare
>>> capacity exists, or swapping tasks across LLCs to improve cache 
>>> affinity.
>>> Grouping of tasks could also be generalized from that of a process
>>> to be that of a NUMA group, or be user configurable.
>>>
>>> Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> ---
>>>
>>> Notes:
>>>      v1->v2:
>>>         Restore the original CPU scan to cover all online CPUs,
>>>         rather than scanning within the preferred NUMA node.
>>>         (Peter Zijlstra)
>>>         Use rq->curr instead of rq->donor. (K Prateek Nayak)
>>>         Minor fix in task_tick_cache() to use
>>>         if (mm->mm_sched_epoch >= rq->cpu_epoch)
>>>         to avoid mm_sched_epoch going backwards.
>>>
>>>   include/linux/mm_types.h |  44 +++++++
>>>   include/linux/sched.h    |  11 ++
>>>   init/Kconfig             |  11 ++
>>>   kernel/fork.c            |   6 +
>>>   kernel/sched/core.c      |   6 +
>>>   kernel/sched/fair.c      | 258 
>>> +++++++++++++++++++++++++++++++++++++++
>>>   kernel/sched/sched.h     |   8 ++
>>>   7 files changed, 344 insertions(+)
>>>
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 90e5790c318f..1ea16ef90566 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -939,6 +939,11 @@ typedef struct {
>>>       DECLARE_BITMAP(__mm_flags, NUM_MM_FLAG_BITS);
>>>   } __private mm_flags_t;
>>> +struct mm_sched {
>>> +    u64 runtime;
>>> +    unsigned long epoch;
>>> +};
>>> +
>>>   struct kioctx_table;
>>>   struct iommu_mm_data;
>>>   struct mm_struct {
>>> @@ -1029,6 +1034,17 @@ struct mm_struct {
>>>            */
>>>           raw_spinlock_t cpus_allowed_lock;
>>>   #endif
>>> +#ifdef CONFIG_SCHED_CACHE
>>> +        /*
>>> +         * Track per-cpu-per-process occupancy as a proxy for cache 
>>> residency.
>>> +         * See account_mm_sched() and ...
>>> +         */
>>> +        struct mm_sched __percpu *pcpu_sched;
>>> +        raw_spinlock_t mm_sched_lock;
>>> +        unsigned long mm_sched_epoch;
>>> +        int mm_sched_cpu;
>> As we discussed earlier，I continue to believe that dedicating 
>> 'mm_sched_cpu' to handle the aggregated hotspots of all threads is 
>> inappropriate, as the multiple threads lack a necessary correlation 
>> in our real application.
>>
>> So, I was wondering if we could put this variable into struct 
>> task_struct, That allows us to better monitor the hotspot CPU of each 
>> thread, despite some details needing consideration.
>>
>
> I suppose you are suggesting a fine-grained control for a set of tasks.
> Process-scope aggregation could be a start as the default strategy(
> conservative, benefit multi-thread workloads that share data per process,
> not introduce regression).

Yes, in our real-world business scenarios at Tencent, I have indeed 
encountered this issue where multiple threads are divided into several 
categories to handle different transactions, so they are not share the 
hot data, the 'mm_sched_cpu'  does not represent all of their task, so 
add a control interface such as cgroup or others will be a good idea.

>
> On top of that, I wonder if we could provide task-scope control like
> sched_setattr(), similar to core-scheduling cookie mechanism, for
> users that want aggressive aggregation. But before doing that, we need a
> mechanism that that leverages a monitor system(like PMU) to figure out
There will maybe a trouble, If the environment is running on a VM, We 
could use tags to differentiate these tasks and do some tests to verify 
the performance difference between unifying the |mm_sched_cpu| and not 
unifying.
> if putting these tasks together would bring benefit(if I understand
> Steven's suggestion correctly on LPC), or detection tasks that share
> resource, then maybe leverage QOS interfaces to enable the cache-aware
> aggregation(something Qias mentioned on the LPC).
>
> thanks,
> Chenyu
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 04/23] sched/cache: Make LLC id continuous
  2025-12-16 19:53         ` Tim Chen
@ 2025-12-17  5:25           ` Chen, Yu C
  0 siblings, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-17  5:25 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra
  Cc: Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy, Vincent Guittot,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

On 12/17/2025 3:53 AM, Tim Chen wrote:
> On Tue, 2025-12-16 at 13:31 +0800, Chen, Yu C wrote:
>> On 12/16/2025 4:49 AM, Tim Chen wrote:
>>> On Tue, 2025-12-09 at 12:58 +0100, Peter Zijlstra wrote:
>>>> On Wed, Dec 03, 2025 at 03:07:23PM -0800, Tim Chen wrote:
>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 710ed9943d27..0a3918269906 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -1210,10 +1210,17 @@ __read_mostly unsigned int llc_imb_pct            = 20;
>>>>>    
>>>>>    static int llc_id(int cpu)
>>>>>    {
>>>>> +	int llc;
>>>>> +
>>>>>    	if (cpu < 0)
>>>>>    		return -1;
>>>>>    
>>>>> +	llc = per_cpu(sd_llc_id, cpu);
>>>>> +	/* avoid race with cpu hotplug */
>>>>> +	if (unlikely(llc >= max_llcs))
>>>>> +		return -1;
>>>>> +
>>>>> +	return llc;
>>>>>    }
>>>>>    
>>>>>    void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
>>>>
>>>>> @@ -668,6 +670,55 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>>>>    DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
>>>>>    DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
>>>>>    
>>>>> +/*
>>>>> + * Assign continuous llc id for the CPU, and return
>>>>> + * the assigned llc id.
>>>>> + */
>>>>> +static int update_llc_id(struct sched_domain *sd,
>>>>> +			 int cpu)
>>>>> +{
>>>>> +	int id = per_cpu(sd_llc_id, cpu), i;
>>>>> +
>>>>> +	if (id >= 0)
>>>>> +		return id;
>>>>> +
>>>>> +	if (sd) {
>>>>> +		/* Look for any assigned id and reuse it.*/
>>>>> +		for_each_cpu(i, sched_domain_span(sd)) {
>>>>> +			id = per_cpu(sd_llc_id, i);
>>>>> +
>>>>> +			if (id >= 0) {
>>>>> +				per_cpu(sd_llc_id, cpu) = id;
>>>>> +				return id;
>>>>> +			}
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +	/*
>>>>> +	 * When 1. there is no id assigned to this LLC domain,
>>>>> +	 * or 2. the sd is NULL, we reach here.
>>>>> +	 * Consider the following scenario,
>>>>> +	 * CPU0~CPU95 are in the node0, CPU96~CPU191 are
>>>>> +	 * in the node1. During bootup, maxcpus=96 is
>>>>> +	 * appended.
>>>>> +	 * case 1: When running cpu_attach_domain(CPU24)
>>>>> +	 * during boot up, CPU24 is the first CPU in its
>>>>> +	 * non-NULL LLC domain. However,
>>>>> +	 * its corresponding llc id has not been assigned yet.
>>>>> +	 *
>>>>> +	 * case 2: After boot up, the CPU100 is brought up
>>>>> +	 * via sysfs manually. As a result, CPU100 has only a
>>>>> +	 * Numa domain attached, because CPU100 is the only CPU
>>>>> +	 * of a sched domain, all its bottom domains are degenerated.
>>>>> +	 * The LLC domain pointer sd is NULL for CPU100.
>>>>> +	 *
>>>>> +	 * For both cases, we want to increase the number of LLCs.
>>>>> +	 */
>>>>> +	per_cpu(sd_llc_id, cpu) = max_llcs++;
>>>>> +
>>>>> +	return per_cpu(sd_llc_id, cpu);
>>>>> +}
>>>>
>>>> I'm not sure I follow. So partition_sched_domains() first calls
>>>> detach_destroy_domains() on the old set, and then build_sched_domains()
>>>> on the new set.
>>>>
>>>> Do detach_destroy_domain() will do:
>>>>
>>>>     cpu_attach_domain(NULL,..);
>>>>
>>>> That is, it will explicitly attach the NULL sched_domain to a CPU. At
>>>> which point I feel update_llc_id() should be returning -1, no?
>>>>
>>>> Then later, build_sched_domains() will set a !NULL sched_domain, at
>>>> which point update_llc_id() can set a real value.
>>>>
>>>> This should then also get rid of that weird max_llcs check in llc_id(),
>>>> right?
>>
>> The check for max_llcs was intended to prevent out-of-bounds access
>> to rq->nr_pref_llc[] at multiple points in the code.
>> Since dst_llc = llc_id(env->dst_cpu); — and while the LLC ID for the
>>    CPU is updated in update_llc_id(), this update occurs before we reallocate
>>    the nr_pref_llc buffer — dst_llc may end up exceeding the bounds of the
>> original nr_pref_llc buffer.
>>
>> For this reason, we added a check if (dst_llc > max_llc) in llc_id()
>> when attempting to access rq->nr_pref_llc[dst_llc].
>>
>> However, I agree that the max_llc check seems to not properly integrated
>> into  the current patch: it should instead be placed in the 7th patch, as
>> this would better illustrate the rationale for the max_llc check here:
>> sched/cache: Introduce per runqueue task LLC preference counter
>>
>> In the 7th patch, we actually increment new_max_llcs rather than
>> max_llcs — meaning max_llcs always represents the "old" number of LLCs.
>> As a result, there is a race window between extending the rq->nr_pref_llc
>> buffer and updating max_llcs.
>>
>>
>> @@ -714,7 +827,7 @@ static int update_llc_id(struct sched_domain *sd,
>>    	 *
>>    	 * For both cases, we want to increase the number of LLCs.
>>    	 */
>> -	per_cpu(sd_llc_id, cpu) = max_llcs++;
>> +	per_cpu(sd_llc_id, cpu) = new_max_llcs++;
>>
>>    	return per_cpu(sd_llc_id, cpu);
>>    }
>>
>>
>>> Thanks for pointing this out.  Yes, we should take care of the
>>> attachment of NULL sd. Will update the code accordingly.
>>>
>>
>> My understanding is that, if the sd is NULL, it is either because invoked
>> by detach_destroy_domain() for the old set, or by case 2 mentioned in
>> above comments:
>> Say, CPU0-CPU95 are online during bootup, the boot command line is
>> maxcpus=96.
>> Later after bootup, the user wants to bring up CPU100, the LLC domain for
>> CPU100 is NULL in this case(due to sd generation), and a new LLC should be
>> detected.
>>
>> That is to say, when we reach update_llc_id(), there could be 2 reasons
>> for NULL sd. For the detach_destroy_domain() case, update_llc_id()
>> should return a valid id without increasing the max_llcs, because of
>>       if (id >= 0)
>>           return id;
>> And for the latter, the max_llcs should be increased.
>> Let me double check on this.
> 
> The issue is we could offline all CPUs in a LLC and online them later.
> In the current code, we will assign their ids all to -1.

I suppose we don't reset the ids in current implementation, only
the first scan of LLCs will reset/initialize the ids to -1 in
build_sched_domains()?
         if (!max_llcs) { //max_llcs is initialized to 0 during bootup
                 for_each_possible_cpu(i)
                         per_cpu(sd_llc_id, i) = -1;
         }

> So on attach
> of CPUs again, we'll be assigning a new LLC.  I think the proper thing
> to do is not to assign llc id of the offlined cpu (the case where sd == NULL)
> and keep the original llc id assigned.  Then we should be okay and not
> increase max_llcs.
> 

This is the current implementation because we don't assign new ids to
CPUs that already have an id(no matter it is offline/online).

thanks,
Chenyu



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling
  2025-12-03 23:07 ` [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling Tim Chen
  2025-12-10 16:51   ` Peter Zijlstra
@ 2025-12-17  9:40   ` Aaron Lu
  2025-12-17 12:51     ` Chen, Yu C
  1 sibling, 1 reply; 111+ messages in thread
From: Aaron Lu @ 2025-12-17  9:40 UTC (permalink / raw)
  To: Chen Yu, Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:36PM -0800, Tim Chen wrote:
> @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
>  		mm->mm_sched_cpu = m_a_cpu;
>  	}
>  
> +	update_avg(&mm->nr_running_avg, nr_running);

update_avg() doesn't appear to deal with small numbers well and can have
an error as large as 7, e.g. when nr_running < 8, nr_running_avg will
always be 0 and when nr_running >= 8 && < 16, nr_running_avg will be
1 - 8, etc.

AMD Genoa has 8 cores per LLC and this will break exceed_llc_nr() there.

>  	free_cpumask_var(cpus);
>  }

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 23/23] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs
  2025-12-03 23:07 ` [PATCH v2 23/23] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Tim Chen
@ 2025-12-17  9:59   ` Aaron Lu
  2025-12-17 13:01     ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: Aaron Lu @ 2025-12-17  9:59 UTC (permalink / raw)
  To: Chen Yu, Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:42PM -0800, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
> 
> Debug patch only.
> 
> Show the per-LLC occupancy in /proc/{PID}/schedstat, with each column
> corresponding to one LLC. This can be used to verify if the cache-aware
> load balancer works as expected by aggregating threads onto dedicated LLCs.
> 
> Suppose there are 2 LLCs and the sampling duration is 10 seconds:
> 
> Enable the cache aware load balance:
> 0 12281  <--- LLC0 residency delta is 0, LLC1 is 12 seconds
> 0 18881
> 0 16217
> 
> disable the cache aware load balance:
> 6497 15802
> 9299 5435
> 17811 8278
> 
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  fs/proc/base.c           | 22 ++++++++++++++++++++++
>  include/linux/mm_types.h | 19 +++++++++++++++++--
>  include/linux/sched.h    |  3 +++
>  kernel/sched/fair.c      | 40 ++++++++++++++++++++++++++++++++++++++--
>  4 files changed, 80 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 6299878e3d97..f4be96f4bd01 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -518,6 +518,28 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
>  		   (unsigned long long)task->se.sum_exec_runtime,
>  		   (unsigned long long)task->sched_info.run_delay,
>  		   task->sched_info.pcount);
> +#ifdef CONFIG_SCHED_CACHE
> +	if (sched_cache_enabled()) {
> +		struct mm_struct *mm = task->mm;
> +		u64 *llc_runtime;
> +
> +		if (!mm)
> +			return 0;
> +
> +		llc_runtime = kcalloc(max_llcs, sizeof(u64), GFP_KERNEL);
> +		if (!llc_runtime)
> +			return 0;
> +
> +		if (get_mm_per_llc_runtime(task, llc_runtime))
> +			goto out;
> +
> +		for (int i = 0; i < max_llcs; i++)
> +			seq_printf(m, "%llu ", llc_runtime[i]);

I feel it is better to also mark the current preferred LLC of this
process so that I can know how well it works.

> +		seq_puts(m, "\n");
> +out:
> +		kfree(llc_runtime);
> +	}
> +#endif
>  
>  	return 0;
>  }

BTW, is there a way to tell if a process is being taken care of by
'cache aware scheduling' or it's blocked due to its huge rss or having
too many threads?

I used below debug code to get these info through schedstat, but maybe I
missed something and there is a simpler method?

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f4be96f4bd015..c709a1a1bd867 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -505,6 +505,7 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
 #endif
 
 #ifdef CONFIG_SCHED_INFO
+DECLARE_PER_CPU(int, sd_llc_id);
 /*
  * Provides /proc/PID/schedstat
  */
@@ -522,6 +523,7 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
 	if (sched_cache_enabled()) {
 		struct mm_struct *mm = task->mm;
 		u64 *llc_runtime;
+		int mm_sched_llc;
 
 		if (!mm)
 			return 0;
@@ -533,8 +535,17 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
 		if (get_mm_per_llc_runtime(task, llc_runtime))
 			goto out;
 
+		if (mm->mm_sched_cpu == -1)
+			mm_sched_llc = -1;
+		else
+			mm_sched_llc = per_cpu(sd_llc_id, mm->mm_sched_cpu);
+
+		seq_printf(m, "%llu 0x%x\n", mm->nr_running_avg, mm->mm_sched_flags);
 		for (int i = 0; i < max_llcs; i++)
-			seq_printf(m, "%llu ", llc_runtime[i]);
+			seq_printf(m, "%s%s%llu ",
+				   i == task->preferred_llc ? "*" : "",
+				   i == mm_sched_llc ? "?" : "",
+				   llc_runtime[i]);
 		seq_puts(m, "\n");
 out:
 		kfree(llc_runtime);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 255c22be7312f..06bb106d1b724 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1048,6 +1048,7 @@ struct mm_struct {
 		raw_spinlock_t mm_sched_lock;
 		unsigned long mm_sched_epoch;
 		int mm_sched_cpu;
+		int mm_sched_flags;
 		u64 nr_running_avg ____cacheline_aligned_in_smp;
 #endif
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 205208f061bb3..ab1cdba65d389 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1237,12 +1237,20 @@ static inline int get_sched_cache_scale(int mul)
 	return (1 + (llc_aggr_tolerance - 1) * mul);
 }
 
+#define MM_SCHED_EXCEED_LLC_CAPACITY	1
+#define MM_SCHED_NO_CACHE_INFO		2
+#define MM_SCHED_EXCEED_LLC_NR		4
+#define MM_SCHED_NR_THREADS		8
+
 static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 {
 	unsigned int llc, scale;
 	struct cacheinfo *ci;
 	unsigned long rss;
 
+	mm->mm_sched_flags &= ~MM_SCHED_NO_CACHE_INFO;
+	mm->mm_sched_flags &= ~MM_SCHED_EXCEED_LLC_CAPACITY;
+
 	/*
 	 * get_cpu_cacheinfo_level() can not be used
 	 * because it requires the cpu_hotplug_lock
@@ -1257,8 +1265,10 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 		 * L2 becomes the LLC.
 		 */
 		ci = _get_cpu_cacheinfo_level(cpu, 2);
-		if (!ci)
+		if (!ci) {
+			mm->mm_sched_flags |= MM_SCHED_NO_CACHE_INFO;
 			return true;
+		}
 	}
 
 	llc = ci->size;
@@ -1283,13 +1293,20 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 	if (scale == INT_MAX)
 		return false;
 
-	return ((llc * scale) <= (rss * PAGE_SIZE));
+	if ((llc * scale) <= (rss * PAGE_SIZE)) {
+		mm->mm_sched_flags |= MM_SCHED_EXCEED_LLC_CAPACITY;
+		return true;
+	}
+
+	return false;
 }
 
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
 	int smt_nr = 1, scale;
 
+	mm->mm_sched_flags &= ~MM_SCHED_EXCEED_LLC_NR;
+
 #ifdef CONFIG_SCHED_SMT
 	if (sched_smt_active())
 		smt_nr = cpumask_weight(cpu_smt_mask(cpu));
@@ -1313,7 +1330,12 @@ static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 	if (scale == INT_MAX)
 		return false;
 
-	return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu)));
+	if ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu))) {
+		mm->mm_sched_flags |= MM_SCHED_EXCEED_LLC_NR;
+		return true;
+	}
+
+	return false;
 }
 
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)

^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue
  2025-12-03 23:07 ` [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
  2025-12-09 12:16   ` Peter Zijlstra
@ 2025-12-17 10:04   ` Vern Hao
  2025-12-17 12:37     ` Chen, Yu C
  1 sibling, 1 reply; 111+ messages in thread
From: Vern Hao @ 2025-12-17 10:04 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel


On 2025/12/4 07:07, Tim Chen wrote:
> For each runqueue, track the number of tasks with an LLC preference
> and how many of them are running on their preferred LLC. This mirrors
> nr_numa_running and nr_preferred_running for NUMA balancing, and will
> be used by cache-aware load balancing in later patches.
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>
> Notes:
>      v1->v2: Invoke task_of() once and reuse its result afterwards.
>              (Peter Zijlstra)
>              Remove hacky reset_llc_stats() and introduce sched_llc_active flag
>              to properly pair enqueue/dequeue statistics update (Peter Zijlstra, K Prateek Nayak)
>
>   include/linux/sched.h |  2 ++
>   init/init_task.c      |  1 +
>   kernel/sched/core.c   |  5 ++++
>   kernel/sched/fair.c   | 60 ++++++++++++++++++++++++++++++++++++++++---
>   kernel/sched/sched.h  |  6 +++++
>   5 files changed, 71 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1ad46220cd04..466ba8b7398c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1408,6 +1408,8 @@ struct task_struct {
>   
>   #ifdef CONFIG_SCHED_CACHE
>   	struct callback_head		cache_work;
> +	/*the p is currently refcounted in a rq's preferred llc stats*/
> +	bool				sched_llc_active;
>   	int				preferred_llc;
>   #endif
>   
> diff --git a/init/init_task.c b/init/init_task.c
> index 44bae72b5b7d..ee78837b0aa2 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -192,6 +192,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
>   	.numa_faults	= NULL,
>   #endif
>   #ifdef CONFIG_SCHED_CACHE
> +	.sched_llc_active = false,
>   	.preferred_llc  = -1,
>   #endif
>   #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e8bdf03a4b7f..48626c81ba8e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -531,6 +531,11 @@ void __trace_set_current_state(int state_value)
>   }
>   EXPORT_SYMBOL(__trace_set_current_state);
>   
> +int task_llc(const struct task_struct *p)
> +{
> +	return per_cpu(sd_llc_id, task_cpu(p));
> +}
> +
>   /*
>    * Serialization rules:
>    *
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 10cec83f65d5..d46a70a9d9fb 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1223,6 +1223,43 @@ static int llc_id(int cpu)
>   	return llc;
>   }
>   
> +static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
> +{
> +	int pref_llc;
> +
> +	if (!sched_cache_enabled())
> +		return;
> +
> +	pref_llc = p->preferred_llc;
> +	if (pref_llc < 0)
> +		return;
> +
> +	rq->nr_llc_running++;
> +	rq->nr_pref_llc_running += (pref_llc == task_llc(p));
> +	p->sched_llc_active = true;
> +}
> +
> +static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
> +{
> +	int pref_llc;
> +
> +	/*
> +	 * Borrow the uc_se->active from uclamp_rq_inc_id(),
> +	 * uclamp_rq_dec_id() to avoid the unbalanced calculation
> +	 * of rq statistics.
> +	 */
> +	if (unlikely(!p->sched_llc_active))
> +		return;
> +
> +	pref_llc = p->preferred_llc;
> +	if (pref_llc < 0)
> +		return;
> +
> +	rq->nr_llc_running--;
> +	rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
> +	p->sched_llc_active = false;
> +}
> +
>   void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
>   {
>   	unsigned long epoch;
> @@ -1294,6 +1331,8 @@ static unsigned long __no_profile fraction_mm_sched(struct rq *rq, struct mm_sch
>   	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
>   }
>   
> +static unsigned int task_running_on_cpu(int cpu, struct task_struct *p);
> +
>   static inline
>   void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>   {
> @@ -1346,8 +1385,13 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>   #endif
>   	}
>   
> -	if (p->preferred_llc != mm_sched_llc)
> +	/* task not on rq accounted later in account_entity_enqueue() */
> +	if (task_running_on_cpu(rq->cpu, p) &&
> +	    p->preferred_llc != mm_sched_llc) {
#ifdef CONFIG_NUMA_BALANCING
                 /*
                  * Don't assign preferred LLC if it
                  * conflicts with NUMA balancing.
                  */
                 if (p->numa_preferred_nid >= 0 &&
                     cpu_to_node(mm->mm_sched_cpu) != p->numa_preferred_nid)
                         mm_sched_llc = -1;
#endif
         }

         /* task not on rq accounted later in account_entity_enqueue() */
         if (task_running_on_cpu(rq->cpu, p) &&
             p->preferred_llc != mm_sched_llc) {
                 account_llc_dequeue(rq, p);
                 p->preferred_llc = mm_sched_llc;
                 account_llc_enqueue(rq, p);

         }

I am a little concerned that there might be cases where both 
|p->preferred_llc| and |mm_sched_llc| are equal to -1 at this point.", 
Is it necessary to add a check here?



> +		account_llc_dequeue(rq, p);
>   		p->preferred_llc = mm_sched_llc;
> +		account_llc_enqueue(rq, p);
> +	}
>   }
>   
>   static void task_tick_cache(struct rq *rq, struct task_struct *p)
> @@ -1475,6 +1519,10 @@ void init_sched_mm(struct task_struct *p) { }
>   
>   static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
>   
> +static void account_llc_enqueue(struct rq *rq, struct task_struct *p) {}
> +
> +static void account_llc_dequeue(struct rq *rq, struct task_struct *p) {}
> +
>   #endif
>   
>   /*
> @@ -3965,9 +4013,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>   {
>   	update_load_add(&cfs_rq->load, se->load.weight);
>   	if (entity_is_task(se)) {
> +		struct task_struct *p = task_of(se);
>   		struct rq *rq = rq_of(cfs_rq);
>   
> -		account_numa_enqueue(rq, task_of(se));
> +		account_numa_enqueue(rq, p);
> +		account_llc_enqueue(rq, p);
>   		list_add(&se->group_node, &rq->cfs_tasks);
>   	}
>   	cfs_rq->nr_queued++;
> @@ -3978,7 +4028,11 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>   {
>   	update_load_sub(&cfs_rq->load, se->load.weight);
>   	if (entity_is_task(se)) {
> -		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
> +		struct task_struct *p = task_of(se);
> +		struct rq *rq = rq_of(cfs_rq);
> +
> +		account_numa_dequeue(rq, p);
> +		account_llc_dequeue(rq, p);
>   		list_del_init(&se->group_node);
>   	}
>   	cfs_rq->nr_queued--;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 728737641847..ee8b70647835 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1126,6 +1126,10 @@ struct rq {
>   	unsigned int		nr_preferred_running;
>   	unsigned int		numa_migrate_on;
>   #endif
> +#ifdef CONFIG_SCHED_CACHE
> +	unsigned int		nr_pref_llc_running;
> +	unsigned int		nr_llc_running;
> +#endif
>   #ifdef CONFIG_NO_HZ_COMMON
>   	unsigned long		last_blocked_load_update_tick;
>   	unsigned int		has_blocked_load;
> @@ -1980,6 +1984,8 @@ init_numa_balancing(u64 clone_flags, struct task_struct *p)
>   
>   #endif /* !CONFIG_NUMA_BALANCING */
>   
> +int task_llc(const struct task_struct *p);
> +
>   static inline void
>   queue_balance_callback(struct rq *rq,
>   		       struct balance_callback *head,

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue
  2025-12-17 10:04   ` Vern Hao
@ 2025-12-17 12:37     ` Chen, Yu C
  0 siblings, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-17 12:37 UTC (permalink / raw)
  To: Vern Hao
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Tim Chen,
	Peter Zijlstra, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Ingo Molnar

On 12/17/2025 6:04 PM, Vern Hao wrote:
> 
> On 2025/12/4 07:07, Tim Chen wrote:
>> @@ -1346,8 +1385,13 @@ void account_mm_sched(struct rq *rq, struct 
>> task_struct *p, s64 delta_exec)
>>   #endif
>>       }
>> -    if (p->preferred_llc != mm_sched_llc)
>> +    /* task not on rq accounted later in account_entity_enqueue() */
>> +    if (task_running_on_cpu(rq->cpu, p) &&
>> +        p->preferred_llc != mm_sched_llc) {
>> #ifdef CONFIG_NUMA_BALANCING
>>                  /*
>>                   * Don't assign preferred LLC if it
>>                   * conflicts with NUMA balancing.
>>                   */
>>                  if (p->numa_preferred_nid >= 0 &&
>>                      cpu_to_node(mm->mm_sched_cpu) != p- 
>>  >numa_preferred_nid)
>>                          mm_sched_llc = -1;
>> #endif
>>          }
>> 
>>          /* task not on rq accounted later in account_entity_enqueue() */
>>          if (task_running_on_cpu(rq->cpu, p) &&
>>              p->preferred_llc != mm_sched_llc) {
>>                  account_llc_dequeue(rq, p);
>>                  p->preferred_llc = mm_sched_llc;
>>                  account_llc_enqueue(rq, p);
>> 
>>          }
>> 
> I am a little concerned that there might be cases where both |p- 
>  >preferred_llc| and |mm_sched_llc| are equal to -1 at this point.", Is 
> it necessary to add a check here?
>

Are you concerned about the mismatch between the percpu runqueue values
of nr_pref_llc_running, nr_pref_llc, and nr_llc_running? This should not
be an issue, because account_llc_dequeue() and account_llc_enqueue() are
always invoked together in account_mm_sched(). If p->preferred_llc = 
mm_sched_llc = -1,
account_llc_dequeue/enqueue() will not be invoked, so it is still paired.
Please let me know if I understand your comments correctly.

thanks,
Chenyu




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling
  2025-12-17  9:40   ` Aaron Lu
@ 2025-12-17 12:51     ` Chen, Yu C
  2025-12-19  3:32       ` Aaron Lu
  0 siblings, 1 reply; 111+ messages in thread
From: Chen, Yu C @ 2025-12-17 12:51 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Tim Chen, linux-kernel, Tim Chen

On 12/17/2025 5:40 PM, Aaron Lu wrote:
> On Wed, Dec 03, 2025 at 03:07:36PM -0800, Tim Chen wrote:
>> @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
>>   		mm->mm_sched_cpu = m_a_cpu;
>>   	}
>>   
>> +	update_avg(&mm->nr_running_avg, nr_running);
> 
> update_avg() doesn't appear to deal with small numbers well and can have
> an error as large as 7, e.g. when nr_running < 8, nr_running_avg will
> always be 0 and when nr_running >= 8 && < 16, nr_running_avg will be
> 1 - 8, etc.
> 
> AMD Genoa has 8 cores per LLC and this will break exceed_llc_nr() there.
> 

Ah, you are right, thanks for pointing this out, dividing by 8 would make
convergence slow for small LLC system. Maybe consider the number of Cores
in the LLC, the smaller the number is, the more we should honor the diff
between two invoking of update_avg()?

static inline void sched_cache_update_avg(u64 *avg, u64 sample)
{
	s64 diff = sample - *avg;
	u32 divisor = clamp_t(u32, nr_cores_llc/4, 2, 8);

	*avg += diff / divisor;
}

For <=8 cores per LLC, the divisor is 2,
for 16 cores per LLC, the divisor is 4,
for >=32 cores per LLC, the divisor is 8

Thanks,
Chenyu

>>   	free_cpumask_var(cpus);
>>   }

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 23/23] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs
  2025-12-17  9:59   ` Aaron Lu
@ 2025-12-17 13:01     ` Chen, Yu C
  0 siblings, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-17 13:01 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Tim Chen, linux-kernel, Tim Chen

On 12/17/2025 5:59 PM, Aaron Lu wrote:
> On Wed, Dec 03, 2025 at 03:07:42PM -0800, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> Debug patch only.
>>
>> Show the per-LLC occupancy in /proc/{PID}/schedstat, with each column
>> corresponding to one LLC. This can be used to verify if the cache-aware
>> load balancer works as expected by aggregating threads onto dedicated LLCs.
>>
>> Suppose there are 2 LLCs and the sampling duration is 10 seconds:
>>
>> Enable the cache aware load balance:
>> 0 12281  <--- LLC0 residency delta is 0, LLC1 is 12 seconds
>> 0 18881
>> 0 16217
>>
>> disable the cache aware load balance:
>> 6497 15802
>> 9299 5435
>> 17811 8278
>>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>   fs/proc/base.c           | 22 ++++++++++++++++++++++
>>   include/linux/mm_types.h | 19 +++++++++++++++++--
>>   include/linux/sched.h    |  3 +++
>>   kernel/sched/fair.c      | 40 ++++++++++++++++++++++++++++++++++++++--
>>   4 files changed, 80 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 6299878e3d97..f4be96f4bd01 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -518,6 +518,28 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
>>   		   (unsigned long long)task->se.sum_exec_runtime,
>>   		   (unsigned long long)task->sched_info.run_delay,
>>   		   task->sched_info.pcount);
>> +#ifdef CONFIG_SCHED_CACHE
>> +	if (sched_cache_enabled()) {
>> +		struct mm_struct *mm = task->mm;
>> +		u64 *llc_runtime;
>> +
>> +		if (!mm)
>> +			return 0;
>> +
>> +		llc_runtime = kcalloc(max_llcs, sizeof(u64), GFP_KERNEL);
>> +		if (!llc_runtime)
>> +			return 0;
>> +
>> +		if (get_mm_per_llc_runtime(task, llc_runtime))
>> +			goto out;
>> +
>> +		for (int i = 0; i < max_llcs; i++)
>> +			seq_printf(m, "%llu ", llc_runtime[i]);
> 
> I feel it is better to also mark the current preferred LLC of this
> process so that I can know how well it works.
> 

Sure.

>> +		seq_puts(m, "\n");
>> +out:
>> +		kfree(llc_runtime);
>> +	}
>> +#endif
>>   
>>   	return 0;
>>   }
> 
> BTW, is there a way to tell if a process is being taken care of by
> 'cache aware scheduling' or it's blocked due to its huge rss or having
> too many threads?
> 
> I used below debug code to get these info through schedstat, but maybe I
> missed something and there is a simpler method?
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index f4be96f4bd015..c709a1a1bd867 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -505,6 +505,7 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
>   #endif
>   
>   #ifdef CONFIG_SCHED_INFO
> +DECLARE_PER_CPU(int, sd_llc_id);
>   /*
>    * Provides /proc/PID/schedstat
>    */
> @@ -522,6 +523,7 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
>   	if (sched_cache_enabled()) {
>   		struct mm_struct *mm = task->mm;
>   		u64 *llc_runtime;
> +		int mm_sched_llc;
>   
>   		if (!mm)
>   			return 0;
> @@ -533,8 +535,17 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
>   		if (get_mm_per_llc_runtime(task, llc_runtime))
>   			goto out;
>   
> +		if (mm->mm_sched_cpu == -1)
> +			mm_sched_llc = -1;
> +		else
> +			mm_sched_llc = per_cpu(sd_llc_id, mm->mm_sched_cpu);

We can use llc_id(mm->mm_sched_cpu).

> +
> +		seq_printf(m, "%llu 0x%x\n", mm->nr_running_avg, mm->mm_sched_flags);
>   		for (int i = 0; i < max_llcs; i++)
> -			seq_printf(m, "%llu ", llc_runtime[i]);
> +			seq_printf(m, "%s%s%llu ",
> +				   i == task->preferred_llc ? "*" : "",
> +				   i == mm_sched_llc ? "?" : "",
> +				   llc_runtime[i]);
>   		seq_puts(m, "\n");
>   out:
>   		kfree(llc_runtime);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 255c22be7312f..06bb106d1b724 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1048,6 +1048,7 @@ struct mm_struct {
>   		raw_spinlock_t mm_sched_lock;
>   		unsigned long mm_sched_epoch;
>   		int mm_sched_cpu;
> +		int mm_sched_flags;
>   		u64 nr_running_avg ____cacheline_aligned_in_smp;
>   #endif
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 205208f061bb3..ab1cdba65d389 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1237,12 +1237,20 @@ static inline int get_sched_cache_scale(int mul)
>   	return (1 + (llc_aggr_tolerance - 1) * mul);
>   }
>   
> +#define MM_SCHED_EXCEED_LLC_CAPACITY	1
> +#define MM_SCHED_NO_CACHE_INFO		2
> +#define MM_SCHED_EXCEED_LLC_NR		4
> +#define MM_SCHED_NR_THREADS		8
> +
>   static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>   {
>   	unsigned int llc, scale;
>   	struct cacheinfo *ci;
>   	unsigned long rss;
>   
> +	mm->mm_sched_flags &= ~MM_SCHED_NO_CACHE_INFO;
> +	mm->mm_sched_flags &= ~MM_SCHED_EXCEED_LLC_CAPACITY;
> +

Maybe we can do some read-comparison before writing the flags, previously
we found that writing the per-process mm struct is very expensive, so
maybe avoid writing to it as much as possible.

I'll fold your changes and do the test. Thanks!

Thanks,
Chenyu


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  2025-12-03 23:07 ` [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
@ 2025-12-18  3:59   ` Vern Hao
  2025-12-18  8:32     ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: Vern Hao @ 2025-12-18  3:59 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel


On 2025/12/4 07:07, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
>
> Prateek and Tingyin reported that memory-intensive workloads (such as
> stream) can saturate memory bandwidth and caches on the preferred LLC
> when sched_cache aggregates too many threads.
>
> To mitigate this, estimate a process's memory footprint by comparing
> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
> exceeds the LLC size, skip cache-aware scheduling.
Restricting RSS prevents many applications from benefiting from this 
optimization. I believe this restriction should be lifted. For 
memory-intensive workloads, the optimization may simply yield no gains, 
but it certainly shouldn't make performance worse. We need to further 
refine this logic.
> Note that RSS is only an approximation of the memory footprint.
> By default, the comparison is strict, but a later patch will allow
> users to provide a hint to adjust this threshold.
>
> According to the test from Adam, some systems do not have shared L3
> but with shared L2 as clusters. In this case, the L2 becomes the LLC[1].
>
> Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@os.amperecomputing.com/
>
> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>
> Notes:
>      v1->v2: Assigned curr_cpu in task_cache_work() before checking
>              exceed_llc_capacity(mm, curr_cpu) to avoid out-of-bound
>              access.(lkp/0day)
>
>   include/linux/cacheinfo.h | 21 ++++++++++-------
>   kernel/sched/fair.c       | 49 +++++++++++++++++++++++++++++++++++----
>   2 files changed, 57 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
> index c8f4f0a0b874..82d0d59ca0e1 100644
> --- a/include/linux/cacheinfo.h
> +++ b/include/linux/cacheinfo.h
> @@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
>   
>   const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf);
>   
> -/*
> - * Get the cacheinfo structure for the cache associated with @cpu at
> - * level @level.
> - * cpuhp lock must be held.
> - */
> -static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
> +static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int level)
>   {
>   	struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
>   	int i;
>   
> -	lockdep_assert_cpus_held();
> -
>   	for (i = 0; i < ci->num_leaves; i++) {
>   		if (ci->info_list[i].level == level) {
>   			if (ci->info_list[i].attributes & CACHE_ID)
> @@ -136,6 +129,18 @@ static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
>   	return NULL;
>   }
>   
> +/*
> + * Get the cacheinfo structure for the cache associated with @cpu at
> + * level @level.
> + * cpuhp lock must be held.
> + */
> +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
> +{
> +	lockdep_assert_cpus_held();
> +
> +	return _get_cpu_cacheinfo_level(cpu, level);
> +}
> +
>   /*
>    * Get the id of the cache associated with @cpu at level @level.
>    * cpuhp lock must be held.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6afa3f9a4e9b..424ec601cfdf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1223,6 +1223,38 @@ static int llc_id(int cpu)
>   	return llc;
>   }
>   
> +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
> +{
> +	struct cacheinfo *ci;
> +	unsigned long rss;
> +	unsigned int llc;
> +
> +	/*
> +	 * get_cpu_cacheinfo_level() can not be used
> +	 * because it requires the cpu_hotplug_lock
> +	 * to be held. Use _get_cpu_cacheinfo_level()
> +	 * directly because the 'cpu' can not be
> +	 * offlined at the moment.
> +	 */
> +	ci = _get_cpu_cacheinfo_level(cpu, 3);
> +	if (!ci) {
> +		/*
> +		 * On system without L3 but with shared L2,
> +		 * L2 becomes the LLC.
> +		 */
> +		ci = _get_cpu_cacheinfo_level(cpu, 2);
> +		if (!ci)
> +			return true;
> +	}
Is there must call it one by one for get llc size? a static variable 
instead in building sched domain?
> +
> +	llc = ci->size;
> +
> +	rss = get_mm_counter(mm, MM_ANONPAGES) +
> +		get_mm_counter(mm, MM_SHMEMPAGES);
> +
> +	return (llc <= (rss * PAGE_SIZE));
> +}
> +
>   static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>   {
>   	int smt_nr = 1;
> @@ -1382,7 +1414,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>   	 */
>   	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
>   	    get_nr_threads(p) <= 1 ||
> -	    exceed_llc_nr(mm, cpu_of(rq))) {
> +	    exceed_llc_nr(mm, cpu_of(rq)) ||
> +	    exceed_llc_capacity(mm, cpu_of(rq))) {
>   		if (mm->mm_sched_cpu != -1)
>   			mm->mm_sched_cpu = -1;
>   	}
> @@ -1439,7 +1472,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
>   	struct mm_struct *mm = p->mm;
>   	unsigned long m_a_occ = 0;
>   	unsigned long curr_m_a_occ = 0;
> -	int cpu, m_a_cpu = -1, nr_running = 0;
> +	int cpu, m_a_cpu = -1, nr_running = 0, curr_cpu;
>   	cpumask_var_t cpus;
>   
>   	WARN_ON_ONCE(work != &p->cache_work);
> @@ -1449,7 +1482,9 @@ static void __no_profile task_cache_work(struct callback_head *work)
>   	if (p->flags & PF_EXITING)
>   		return;
>   
> -	if (get_nr_threads(p) <= 1) {
> +	curr_cpu = task_cpu(p);
> +	if (get_nr_threads(p) <= 1 ||
> +	    exceed_llc_capacity(mm, curr_cpu)) {
>   		if (mm->mm_sched_cpu != -1)
>   			mm->mm_sched_cpu = -1;
>   
> @@ -9895,8 +9930,12 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
>   	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
>   		return mig_unrestricted;
>   
> -	/* skip cache aware load balance for single/too many threads */
> -	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
> +	/*
> +	 * Skip cache aware load balance for single/too many threads
> +	 * or large footprint.
> +	 */
> +	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
> +	    exceed_llc_capacity(mm, dst_cpu))
>   		return mig_unrestricted;
>   
>   	if (cpus_share_cache(dst_cpu, cpu))

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  2025-12-18  3:59   ` Vern Hao
@ 2025-12-18  8:32     ` Chen, Yu C
  2025-12-18  9:42       ` Vern Hao
  0 siblings, 1 reply; 111+ messages in thread
From: Chen, Yu C @ 2025-12-18  8:32 UTC (permalink / raw)
  To: Vern Hao
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Tim Chen,
	Peter Zijlstra, K Prateek Nayak, Vincent Guittot,
	Gautham R . Shenoy, Ingo Molnar

On 12/18/2025 11:59 AM, Vern Hao wrote:
> 
> On 2025/12/4 07:07, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> Prateek and Tingyin reported that memory-intensive workloads (such as
>> stream) can saturate memory bandwidth and caches on the preferred LLC
>> when sched_cache aggregates too many threads.
>>
>> To mitigate this, estimate a process's memory footprint by comparing
>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>> exceeds the LLC size, skip cache-aware scheduling.
> Restricting RSS prevents many applications from benefiting from this 
> optimization. I believe this restriction should be lifted. 
> For memory- 
> intensive workloads, the optimization may simply yield no gains, but it 
> certainly shouldn't make performance worse. We need to further refine 
> this logic.

Memory-intensive workloads may trigger performance regressions when
memory bandwidth(from L3 cache to memory controller) is saturated due
to task aggregation on single LLC. We have seen this issue in stream
benchmark runs in previous version.

Patch 23 introduces a debugfs knob llc_aggr_tolerance that lets userspace
tune the scale factor. This allows memory-intensive workloads to perform
task aggregation when their footprint is small and the administrator 
considers
it safe. As you noted in another patch, fine-grained control would improve
flexibility—and this can be addressed in future iterations.

>> Note that RSS is only an approximation of the memory footprint.
>> By default, the comparison is strict, but a later patch will allow
>> users to provide a hint to adjust this threshold.
>>
>> According to the test from Adam, some systems do not have shared L3
>> but with shared L2 as clusters. In this case, the L2 becomes the LLC[1].
>>
>> Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739- 
>> b00e28a09cb6@os.amperecomputing.com/
>>
>> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>
>> Notes:
>>      v1->v2: Assigned curr_cpu in task_cache_work() before checking
>>              exceed_llc_capacity(mm, curr_cpu) to avoid out-of-bound
>>              access.(lkp/0day)
>>
>>   include/linux/cacheinfo.h | 21 ++++++++++-------
>>   kernel/sched/fair.c       | 49 +++++++++++++++++++++++++++++++++++----
>>   2 files changed, 57 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
>> index c8f4f0a0b874..82d0d59ca0e1 100644
>> --- a/include/linux/cacheinfo.h
>> +++ b/include/linux/cacheinfo.h
>> @@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
>>   const struct attribute_group *cache_get_priv_group(struct cacheinfo 
>> *this_leaf);
>> -/*
>> - * Get the cacheinfo structure for the cache associated with @cpu at
>> - * level @level.
>> - * cpuhp lock must be held.
>> - */
>> -static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int 
>> level)
>> +static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int 
>> level)
>>   {
>>       struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
>>       int i;
>> -    lockdep_assert_cpus_held();
>> -
>>       for (i = 0; i < ci->num_leaves; i++) {
>>           if (ci->info_list[i].level == level) {
>>               if (ci->info_list[i].attributes & CACHE_ID)
>> @@ -136,6 +129,18 @@ static inline struct cacheinfo 
>> *get_cpu_cacheinfo_level(int cpu, int level)
>>       return NULL;
>>   }
>> +/*
>> + * Get the cacheinfo structure for the cache associated with @cpu at
>> + * level @level.
>> + * cpuhp lock must be held.
>> + */
>> +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int 
>> level)
>> +{
>> +    lockdep_assert_cpus_held();
>> +
>> +    return _get_cpu_cacheinfo_level(cpu, level);
>> +}
>> +
>>   /*
>>    * Get the id of the cache associated with @cpu at level @level.
>>    * cpuhp lock must be held.
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6afa3f9a4e9b..424ec601cfdf 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1223,6 +1223,38 @@ static int llc_id(int cpu)
>>       return llc;
>>   }
>> +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>> +{
>> +    struct cacheinfo *ci;
>> +    unsigned long rss;
>> +    unsigned int llc;
>> +
>> +    /*
>> +     * get_cpu_cacheinfo_level() can not be used
>> +     * because it requires the cpu_hotplug_lock
>> +     * to be held. Use _get_cpu_cacheinfo_level()
>> +     * directly because the 'cpu' can not be
>> +     * offlined at the moment.
>> +     */
>> +    ci = _get_cpu_cacheinfo_level(cpu, 3);
>> +    if (!ci) {
>> +        /*
>> +         * On system without L3 but with shared L2,
>> +         * L2 becomes the LLC.
>> +         */
>> +        ci = _get_cpu_cacheinfo_level(cpu, 2);
>> +        if (!ci)
>> +            return true;
>> +    }
> Is there must call it one by one for get llc size? a static variable 
> instead in building sched domain?

I suppose you suggested introducing a per-CPU variable, like 
percpu(sd_llc_bytes, cpu),
or something similar to struct cpuinfo_x86.x86_cache_size. I am not sure 
if the community
would endorse introducing this variable, given that sched_cache would be 
its only user.
We can leave this as an open question.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  2025-12-18  8:32     ` Chen, Yu C
@ 2025-12-18  9:42       ` Vern Hao
  2025-12-19  3:14         ` K Prateek Nayak
  0 siblings, 1 reply; 111+ messages in thread
From: Vern Hao @ 2025-12-18  9:42 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Tim Chen,
	Peter Zijlstra, K Prateek Nayak, Vincent Guittot,
	Gautham R . Shenoy, Ingo Molnar, Vern Hao


On 2025/12/18 16:32, Chen, Yu C wrote:
> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>
>> On 2025/12/4 07:07, Tim Chen wrote:
>>> From: Chen Yu <yu.c.chen@intel.com>
>>>
>>> Prateek and Tingyin reported that memory-intensive workloads (such as
>>> stream) can saturate memory bandwidth and caches on the preferred LLC
>>> when sched_cache aggregates too many threads.
>>>
>>> To mitigate this, estimate a process's memory footprint by comparing
>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>> exceeds the LLC size, skip cache-aware scheduling.
>> Restricting RSS prevents many applications from benefiting from this 
>> optimization. I believe this restriction should be lifted. For 
>> memory- intensive workloads, the optimization may simply yield no 
>> gains, but it certainly shouldn't make performance worse. We need to 
>> further refine this logic.
>
> Memory-intensive workloads may trigger performance regressions when
> memory bandwidth(from L3 cache to memory controller) is saturated due
RSS size and bandwidth saturation are not necessarily linked, In my 
view, the optimization should be robust enough that it doesn't cause a 
noticeable drop in performance, no matter how large the RSS is. We need 
to have a more profound discussion on this.
> to task aggregation on single LLC. We have seen this issue in stream
> benchmark runs in previous version.
>
> Patch 23 introduces a debugfs knob llc_aggr_tolerance that lets userspace
> tune the scale factor. This allows memory-intensive workloads to perform
> task aggregation when their footprint is small and the administrator 
> considers
> it safe. As you noted in another patch, fine-grained control would 
> improve
> flexibility—and this can be addressed in future iterations.
>
>>> Note that RSS is only an approximation of the memory footprint.
>>> By default, the comparison is strict, but a later patch will allow
>>> users to provide a hint to adjust this threshold.
>>>
>>> According to the test from Adam, some systems do not have shared L3
>>> but with shared L2 as clusters. In this case, the L2 becomes the 
>>> LLC[1].
>>>
>>> Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739- 
>>> b00e28a09cb6@os.amperecomputing.com/
>>>
>>> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> ---
>>>
>>> Notes:
>>>      v1->v2: Assigned curr_cpu in task_cache_work() before checking
>>>              exceed_llc_capacity(mm, curr_cpu) to avoid out-of-bound
>>>              access.(lkp/0day)
>>>
>>>   include/linux/cacheinfo.h | 21 ++++++++++-------
>>>   kernel/sched/fair.c       | 49 
>>> +++++++++++++++++++++++++++++++++++----
>>>   2 files changed, 57 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
>>> index c8f4f0a0b874..82d0d59ca0e1 100644
>>> --- a/include/linux/cacheinfo.h
>>> +++ b/include/linux/cacheinfo.h
>>> @@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
>>>   const struct attribute_group *cache_get_priv_group(struct 
>>> cacheinfo *this_leaf);
>>> -/*
>>> - * Get the cacheinfo structure for the cache associated with @cpu at
>>> - * level @level.
>>> - * cpuhp lock must be held.
>>> - */
>>> -static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, 
>>> int level)
>>> +static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, 
>>> int level)
>>>   {
>>>       struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
>>>       int i;
>>> -    lockdep_assert_cpus_held();
>>> -
>>>       for (i = 0; i < ci->num_leaves; i++) {
>>>           if (ci->info_list[i].level == level) {
>>>               if (ci->info_list[i].attributes & CACHE_ID)
>>> @@ -136,6 +129,18 @@ static inline struct cacheinfo 
>>> *get_cpu_cacheinfo_level(int cpu, int level)
>>>       return NULL;
>>>   }
>>> +/*
>>> + * Get the cacheinfo structure for the cache associated with @cpu at
>>> + * level @level.
>>> + * cpuhp lock must be held.
>>> + */
>>> +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, 
>>> int level)
>>> +{
>>> +    lockdep_assert_cpus_held();
>>> +
>>> +    return _get_cpu_cacheinfo_level(cpu, level);
>>> +}
>>> +
>>>   /*
>>>    * Get the id of the cache associated with @cpu at level @level.
>>>    * cpuhp lock must be held.
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 6afa3f9a4e9b..424ec601cfdf 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1223,6 +1223,38 @@ static int llc_id(int cpu)
>>>       return llc;
>>>   }
>>> +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>>> +{
>>> +    struct cacheinfo *ci;
>>> +    unsigned long rss;
>>> +    unsigned int llc;
>>> +
>>> +    /*
>>> +     * get_cpu_cacheinfo_level() can not be used
>>> +     * because it requires the cpu_hotplug_lock
>>> +     * to be held. Use _get_cpu_cacheinfo_level()
>>> +     * directly because the 'cpu' can not be
>>> +     * offlined at the moment.
>>> +     */
>>> +    ci = _get_cpu_cacheinfo_level(cpu, 3);
>>> +    if (!ci) {
>>> +        /*
>>> +         * On system without L3 but with shared L2,
>>> +         * L2 becomes the LLC.
>>> +         */
>>> +        ci = _get_cpu_cacheinfo_level(cpu, 2);
>>> +        if (!ci)
>>> +            return true;
>>> +    }
>> Is there must call it one by one for get llc size? a static variable 
>> instead in building sched domain?
>
> I suppose you suggested introducing a per-CPU variable, like 
> percpu(sd_llc_bytes, cpu),
> or something similar to struct cpuinfo_x86.x86_cache_size. I am not 
> sure if the community
> would endorse introducing this variable, given that sched_cache would 
> be its only user.
> We can leave this as an open question.
>
> thanks,
> Chenyu

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  2025-12-18  9:42       ` Vern Hao
@ 2025-12-19  3:14         ` K Prateek Nayak
  2025-12-19 12:55           ` Chen, Yu C
  2025-12-22  2:19           ` Vern Hao
  0 siblings, 2 replies; 111+ messages in thread
From: K Prateek Nayak @ 2025-12-19  3:14 UTC (permalink / raw)
  To: Vern Hao, Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Tim Chen,
	Peter Zijlstra, Vincent Guittot, Gautham R . Shenoy, Ingo Molnar

Hello Vern,

On 12/18/2025 3:12 PM, Vern Hao wrote:
> 
> On 2025/12/18 16:32, Chen, Yu C wrote:
>> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>>
>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>> From: Chen Yu <yu.c.chen@intel.com>
>>>>
>>>> Prateek and Tingyin reported that memory-intensive workloads (such as
>>>> stream) can saturate memory bandwidth and caches on the preferred LLC
>>>> when sched_cache aggregates too many threads.
>>>>
>>>> To mitigate this, estimate a process's memory footprint by comparing
>>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>>> exceeds the LLC size, skip cache-aware scheduling.
>>> Restricting RSS prevents many applications from benefiting from this optimization. I believe this restriction should be lifted. For memory- intensive workloads, the optimization may simply yield no gains, but it certainly shouldn't make performance worse. We need to further refine this logic.
>>
>> Memory-intensive workloads may trigger performance regressions when
>> memory bandwidth(from L3 cache to memory controller) is saturated due
> RSS size and bandwidth saturation are not necessarily linked, In my view, the optimization should be robust enough that it doesn't cause a noticeable drop in performance, no matter how large the RSS is.

Easier said than done. I agree RSS size is not a clear indication of
bandwidth saturation. With NUMA Balancing enabled, we can use the
hinting faults to estimate the working set and make decisions but for
systems that do not have NUMA, short of programming some performance
counters, there is no real way to estimate the working set.

Hinting faults are known to cause overheads so enabling them without
NUMA can cause noticeable overheads with no real benefits.

> We need to have a more profound discussion on this.

What do you have in mind?

From where I stand, having the RSS based bailout for now won't make
things worse for these tasks with huge memory reserves and when we can
all agree on some generic method to estimate the working set of a task,
we can always add it into exceed_llc_capacity().

-- 
Thanks and Regards,
Prateek

"Rome wasn't built in a day but they were laying bricks every hour.
 You don't have to build everything you want today, just lay a brick."

  - James Clear


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 00/23] Cache aware scheduling
  2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
                   ` (22 preceding siblings ...)
  2025-12-03 23:07 ` [PATCH v2 23/23] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Tim Chen
@ 2025-12-19  3:19 ` Aaron Lu
  2025-12-19 13:04   ` Chen, Yu C
  23 siblings, 1 reply; 111+ messages in thread
From: Aaron Lu @ 2025-12-19  3:19 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Tim Chen, linux-kernel

On Wed, Dec 03, 2025 at 03:07:19PM -0800, Tim Chen wrote:
... ...
> Test results:
> 
> The patch series was applied and tested on v6.18-rc7.
> See: https://github.com/timcchen1298/linux/commits/cache_aware_v2
> 
> The first test platform is a 2 socket Intel Sapphire Rapids with 30
> cores per socket. The DRAM interleaving is enabled in the BIOS so it
> essential has one NUMA node with two last level caches. There are 60
> CPUs associated with each last level cache.
> 
> The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
> per node. Each node has 2 CCXs and each CCX has 16 CPUs.
> 
> hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched on
> these two platforms.
> 
> [TL;DR]
> Sappire Rapids:
> hackbench shows significant improvement when the number of
> different active threads is below the capacity of a LLC.
> schbench shows overall wakeup latency improvement.
> ChaCha20-xiangshan shows good throughput improvement.
> 
> Genoa:
> ChaCha20-xiangshan shows huge throughput improvement.
> No obvious difference is observed in hackbench/schbench

I think for small task number hackbench run, there should be some
improvement.

I tried thread/pipe/2fds/1group, i.e. 4 tasks on Genoa:
./hackbench -T -f 2 -g 1 -p -l 2000000
And I noticed performance improved a lot:
(Result in seconds, less is better)

       llc_off       llc_on          diff
time   4.755±1.6%    2.684±6.25%    +43.6%

llc_off means /sys/kernel/debug/sched/llc_enabled set to 0 while
llc_on means /sys/kernel/debug/sched/llc_enabled set to 1, other
tunnables are left unchanged.
Turbo is disabled and cpufreq set to performance.

I also tried redis and noticed when I set io-threads to 4 in redis.conf,
there is also some improvement on AMD Genoa:

                 llc_off        manual      diff     llc_on      diff
throughput      1536727±0%     1737619±0%  +13.1%   1737720±0%  +13.1%

Client cmdline:
numactl -N 1 redis-benchmark --threads 4 -t set -r 100000 -P 16 -n 10000000
server cmdline: numactl -N 0 redis-server ./redis.conf
I also tried to manually bind all tasks of redis server to a single LLC
to see if this workload benefits from aggregation and that's what manual
means: taskset -c 8-15,200-207 redis-server ./redis.conf

According to the result, I think this 'cache aware scheduling' works
as expected in that its performance is the same as manual binding; and
they all beat llc_off.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling
  2025-12-17 12:51     ` Chen, Yu C
@ 2025-12-19  3:32       ` Aaron Lu
  0 siblings, 0 replies; 111+ messages in thread
From: Aaron Lu @ 2025-12-19  3:32 UTC (permalink / raw)
  To: Chen,  Yu C
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Tim Chen, linux-kernel, Tim Chen

On Wed, Dec 17, 2025 at 08:51:50PM +0800, Chen, Yu C wrote:
> On 12/17/2025 5:40 PM, Aaron Lu wrote:
> > On Wed, Dec 03, 2025 at 03:07:36PM -0800, Tim Chen wrote:
> > > @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
> > >   		mm->mm_sched_cpu = m_a_cpu;
> > >   	}
> > > +	update_avg(&mm->nr_running_avg, nr_running);
> > 
> > update_avg() doesn't appear to deal with small numbers well and can have
> > an error as large as 7, e.g. when nr_running < 8, nr_running_avg will
> > always be 0 and when nr_running >= 8 && < 16, nr_running_avg will be
> > 1 - 8, etc.
> > 
> > AMD Genoa has 8 cores per LLC and this will break exceed_llc_nr() there.
> > 
> 
> Ah, you are right, thanks for pointing this out, dividing by 8 would make
> convergence slow for small LLC system. Maybe consider the number of Cores

Not just slow but the error is too large for a small LLC.

> in the LLC, the smaller the number is, the more we should honor the diff
> between two invoking of update_avg()?
> 
> static inline void sched_cache_update_avg(u64 *avg, u64 sample)
> {
> 	s64 diff = sample - *avg;
> 	u32 divisor = clamp_t(u32, nr_cores_llc/4, 2, 8);
> 
> 	*avg += diff / divisor;
> }
> 
> For <=8 cores per LLC, the divisor is 2,
> for 16 cores per LLC, the divisor is 4,
> for >=32 cores per LLC, the divisor is 8

Yeah I guess it works. The error can be as large as 'divisor - 1' but
since this avg is an estimate, it may be OK.

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes
  2025-12-15 19:32     ` Tim Chen
@ 2025-12-19  4:01       ` Vern Hao
  2025-12-24 10:20         ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: Vern Hao @ 2025-12-19  4:01 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel, Vern Hao


On 2025/12/16 03:32, Tim Chen wrote:
> On Fri, 2025-12-12 at 11:34 +0800, Vern Hao wrote:
>> On 2025/12/4 07:07, Tim Chen wrote:
>>> With cache-aware scheduling enabled, each task is assigned a
>>> preferred LLC ID. This allows quick identification of the LLC domain
>>> where the task prefers to run, similar to numa_preferred_nid in
>>> NUMA balancing.
>>>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> ---
>>>
>>> Notes:
>>>       v1->v2: Align preferred LLC with NUMA balancing's preferred node.
>>>
>>>    include/linux/sched.h |  1 +
>>>    init/init_task.c      |  3 +++
>>>    kernel/sched/fair.c   | 18 ++++++++++++++++++
>>>    3 files changed, 22 insertions(+)
>>>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index 278b529c91df..1ad46220cd04 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1408,6 +1408,7 @@ struct task_struct {
>>>    
>>>    #ifdef CONFIG_SCHED_CACHE
>>>    	struct callback_head		cache_work;
>>> +	int				preferred_llc;
>>>    #endif
>>>    
>>>    #ifdef CONFIG_RSEQ
>>> diff --git a/init/init_task.c b/init/init_task.c
>>> index a55e2189206f..44bae72b5b7d 100644
>>> --- a/init/init_task.c
>>> +++ b/init/init_task.c
>>> @@ -191,6 +191,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
>>>    	.numa_group	= NULL,
>>>    	.numa_faults	= NULL,
>>>    #endif
>>> +#ifdef CONFIG_SCHED_CACHE
>>> +	.preferred_llc  = -1,
>>> +#endif
>>>    #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
>>>    	.kasan_depth	= 1,
>>>    #endif
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 0a3918269906..10cec83f65d5 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1300,6 +1300,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>>>    	struct mm_struct *mm = p->mm;
>>>    	struct mm_sched *pcpu_sched;
>>>    	unsigned long epoch;
>>> +	int mm_sched_llc = -1;
>>>    
>>>    	if (!sched_cache_enabled())
>>>    		return;
>>> @@ -1330,6 +1331,23 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>>>    		if (mm->mm_sched_cpu != -1)
>>>    			mm->mm_sched_cpu = -1;
>>>    	}
>>> +
>>> +	if (mm->mm_sched_cpu != -1) {
>>> +		mm_sched_llc = llc_id(mm->mm_sched_cpu);
>>> +
>>> +#ifdef CONFIG_NUMA_BALANCING
>>> +		/*
>>> +		 * Don't assign preferred LLC if it
>>> +		 * conflicts with NUMA balancing.
>>> +		 */
>>> +		if (p->numa_preferred_nid >= 0 &&
>> I wonder if the restriction here shouldn't be so strict. In Mel Gorman's
>> patch (e496132ebedd sched/fair: Adjust the allowed NUMA imbalance when
>> SD_NUMA spans multiple LLCs), the value of the 'imb_numa_nr' is checked
>> to determine if |SD_NUMA| imbalance is allowed. Could we use this same
>> check to decide whether or not to perform a cross-numa migration?
> If we set the preferred LLC that's in a different node other than the preferred
> node, the preferred LLC is going to fight with NUMA balancing and bounce
> tasks back and forth between nodes. NUMA locality is going to affect performance
> more so we'll let NUMA preference take precedence.

I might not have explained myself clearly. I’m questioning whether we 
need to integrate an imbalance check into the 'sgs->group_type|' is 
’group_has_spare|' scenario,  like Mel’s patch, to refine our llc 
migration decisions.

just like this:  8 cpus in one LLC, LLC-A has 6 tasks,  LLC-B has 2 
tasks, if LLC-A has task_a need to migrate to LLC_B, how to deal it ?

> Tim
>
>>> +		    cpu_to_node(mm->mm_sched_cpu) != p->numa_preferred_nid)
>>> +			mm_sched_llc = -1;
>>> +#endif
>>> +	}
>>> +
>>> +	if (p->preferred_llc != mm_sched_llc)
>>> +		p->preferred_llc = mm_sched_llc;
>>>    }
>>>    
>>>    static void task_tick_cache(struct rq *rq, struct task_struct *p)

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling
  2025-12-03 23:07 ` [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling Tim Chen
  2025-12-10 17:02   ` Peter Zijlstra
@ 2025-12-19  4:14   ` Vern Hao
  2025-12-19 13:21     ` Chen, Yu C
  2025-12-19 13:39     ` Chen, Yu C
  2025-12-23 12:12   ` Yangyu Chen
  2 siblings, 2 replies; 111+ messages in thread
From: Vern Hao @ 2025-12-19  4:14 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot
  Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Vern Hao


On 2025/12/4 07:07, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
>
> Introduce a set of debugfs knobs to control the enabling of
> and parameters for cache-aware load balancing.
>
> (1) llc_enabled
> llc_enabled acts as the primary switch - users can toggle it to
> enable or disable cache aware load balancing.
>
> (2) llc_aggr_tolerance
> With sched_cache enabled, the scheduler uses a process's RSS as a
> proxy for its LLC footprint to determine if aggregating tasks on the
> preferred LLC could cause cache contention. If RSS exceeds the LLC
> size, aggregation is skipped. Some workloads with large RSS but small
> actual memory footprints may still benefit from aggregation. Since
> the kernel cannot efficiently track per-task cache usage (resctrl is
> user-space only), userspace can provide a more accurate hint.
>
> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
> users control how strictly RSS limits aggregation. Values range from
> 0 to 100:
>
>    - 0: Cache-aware scheduling is disabled.
>    - 1: Strict; tasks with RSS larger than LLC size are skipped.
>    - 100: Aggressive; tasks are aggregated regardless of RSS.
>
> For example, with a 32MB L3 cache:
>
>    - llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
>    - llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
>      (784GB = (1 + (99 - 1) * 256) * 32MB).
>
> Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
> how strictly the number of active threads is considered when doing
> cache aware load balance. The number of SMTs is also considered.
> High SMT counts reduce the aggregation capacity, preventing excessive
> task aggregation on SMT-heavy systems like Power10/Power11.
>
> For example, with 8 Cores/16 CPUs in a L3:
>
>    - llc_aggr_tolerance=1 -> tasks with nr_running > 8 are skipped.
>    - llc_aggr_tolerance=99 -> tasks with nr_running > 785 are skipped
>      785 = (1 + (99 - 1) * 8).
>
> (3) llc_epoch_period/llc_epoch_affinity_timeout
> Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned
> into tunable.
>
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
> Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>
> Notes:
>      v1->v2: Remove the smt_nr check in fits_llc_capacity().
>              (Aaron Lu)
>
>   include/linux/sched.h   |  4 ++-
>   kernel/sched/debug.c    | 62 ++++++++++++++++++++++++++++++++++++++++
>   kernel/sched/fair.c     | 63 ++++++++++++++++++++++++++++++++++++-----
>   kernel/sched/sched.h    |  5 ++++
>   kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++--
>   5 files changed, 178 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 466ba8b7398c..95bf080bbbf0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
>   DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>   
>   #ifdef CONFIG_SCHED_CACHE
> +DECLARE_STATIC_KEY_FALSE(sched_cache_on);
> +
>   static inline bool sched_cache_enabled(void)
>   {
> -	return false;
> +	return static_branch_unlikely(&sched_cache_on);
>   }
>   #endif
>   
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 02e16b70a790..cde324672103 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops = {
>   	.release	= single_release,
>   };
>   
> +#ifdef CONFIG_SCHED_CACHE
> +#define SCHED_CACHE_CREATE_CONTROL(name, max)			  \
> +static ssize_t sched_cache_write_##name(struct file *filp,	  \
> +					const char __user *ubuf,  \
> +					size_t cnt, loff_t *ppos) \
> +{								  \
> +	char buf[16];						  \
> +	unsigned int val;					  \
> +	if (cnt > 15)						  \
> +		cnt = 15;					  \
> +	if (copy_from_user(&buf, ubuf, cnt))			  \
> +		return -EFAULT;					  \
> +	buf[cnt] = '\0';					  \
> +	if (kstrtouint(buf, 10, &val))				  \
> +		return -EINVAL;					  \
> +	if (val > (max))						  \
> +		return -EINVAL;					  \
> +	llc_##name = val;					  \
> +	if (!strcmp(#name, "enabled"))				  \
> +		sched_cache_set(false);				  \
> +	*ppos += cnt;						  \
> +	return cnt;						  \
> +}								  \
> +static int sched_cache_show_##name(struct seq_file *m, void *v)	  \
> +{								  \
> +	seq_printf(m, "%d\n", llc_##name);			  \
> +	return 0;						  \
> +}								  \
> +static int sched_cache_open_##name(struct inode *inode,		  \
> +				   struct file *filp)		  \
> +{								  \
> +	return single_open(filp, sched_cache_show_##name, NULL);  \
> +}								  \
> +static const struct file_operations sched_cache_fops_##name = {	  \
> +	.open		= sched_cache_open_##name,		  \
> +	.write		= sched_cache_write_##name,		  \
> +	.read		= seq_read,				  \
> +	.llseek		= seq_lseek,				  \
> +	.release	= single_release,			  \
> +}
> +
> +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
> +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
> +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
> +SCHED_CACHE_CREATE_CONTROL(enabled, 1);
> +#endif /* SCHED_CACHE */
> +
>   static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf,
>   				   size_t cnt, loff_t *ppos)
>   {
> @@ -523,6 +570,21 @@ static __init int sched_init_debug(void)
>   	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
>   #endif /* CONFIG_NUMA_BALANCING */
>   
> +#ifdef CONFIG_SCHED_CACHE
> +	debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
> +			    &sched_cache_fops_overload_pct);
> +	debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
> +			    &sched_cache_fops_imb_pct);
> +	debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
> +			    &sched_cache_fops_aggr_tolerance);
> +	debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
> +			    &sched_cache_fops_enabled);
> +	debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
> +			   &llc_epoch_period);
> +	debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
> +			   &llc_epoch_affinity_timeout);
> +#endif
> +
>   	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>   
>   	debugfs_fair_server_init();
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 424ec601cfdf..a2e2d6742481 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>   
>   __read_mostly unsigned int llc_overload_pct       = 50;
>   __read_mostly unsigned int llc_imb_pct            = 20;
> +__read_mostly unsigned int llc_aggr_tolerance     = 1;
> +__read_mostly unsigned int llc_epoch_period       = EPOCH_PERIOD;
> +__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
>   
>   static int llc_id(int cpu)
>   {
> @@ -1223,11 +1226,22 @@ static int llc_id(int cpu)
>   	return llc;
>   }
>   
> +static inline int get_sched_cache_scale(int mul)
> +{
> +	if (!llc_aggr_tolerance)
> +		return 0;
> +
> +	if (llc_aggr_tolerance == 100)
the range of llc_aggr_tolerance is [0, 100], so a little bug here? maybe 
check if (llc_aggr_tolerance >= 100)

and if llc_aggr_tolerance = 0, the func return 0, it means 
exceed_llc_capacity & exceed_llc_nr always true, there maybeinconsistent 
to have this value set while |llc_enable=1| is set.

> +		return INT_MAX;
> +
> +	return (1 + (llc_aggr_tolerance - 1) * mul);
> +}
> +
>   static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>   {
> +	unsigned int llc, scale;
>   	struct cacheinfo *ci;
>   	unsigned long rss;
> -	unsigned int llc;
>   
>   	/*
>   	 * get_cpu_cacheinfo_level() can not be used
> @@ -1252,19 +1266,54 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>   	rss = get_mm_counter(mm, MM_ANONPAGES) +
>   		get_mm_counter(mm, MM_SHMEMPAGES);
>   
> -	return (llc <= (rss * PAGE_SIZE));
> +	/*
> +	 * Scale the LLC size by 256*llc_aggr_tolerance
> +	 * and compare it to the task's RSS size.
> +	 *
> +	 * Suppose the L3 size is 32MB. If the
> +	 * llc_aggr_tolerance is 1:
> +	 * When the RSS is larger than 32MB, the process
> +	 * is regarded as exceeding the LLC capacity. If
> +	 * the llc_aggr_tolerance is 99:
> +	 * When the RSS is larger than 784GB, the process
> +	 * is regarded as exceeding the LLC capacity because:
> +	 * 784GB = (1 + (99 - 1) * 256) * 32MB
> +	 */
> +	scale = get_sched_cache_scale(256);
> +	if (scale == INT_MAX)
> +		return false;
> +
> +	return ((llc * scale) <= (rss * PAGE_SIZE));
>   }
>   
>   static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>   {
> -	int smt_nr = 1;
> +	int smt_nr = 1, scale;
>   
>   #ifdef CONFIG_SCHED_SMT
>   	if (sched_smt_active())
>   		smt_nr = cpumask_weight(cpu_smt_mask(cpu));
>   #endif
> +	/*
> +	 * Scale the Core number in a LLC by llc_aggr_tolerance
> +	 * and compare it to the task's active threads.
> +	 *
> +	 * Suppose the number of Cores in LLC is 8.
> +	 * Every core has 2 SMTs.
> +	 * If the llc_aggr_tolerance is 1: When the
> +	 * nr_running is larger than 8, the process
> +	 * is regarded as exceeding the LLC capacity.
> +	 * If the llc_aggr_tolerance is 99:
> +	 * When the nr_running is larger than 785,
> +	 * the process is regarded as exceeding
> +	 * the LLC capacity:
> +	 * 785 = 1 + (99 - 1) * 8
> +	 */
> +	scale = get_sched_cache_scale(1);
> +	if (scale == INT_MAX)
> +		return false;
>   
> -	return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
> +	return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu)));
>   }
>   
>   static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
> @@ -1350,9 +1399,9 @@ static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
>   	long delta = now - rq->cpu_epoch_next;
>   
>   	if (delta > 0) {
> -		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
> +		n = (delta + llc_epoch_period - 1) / llc_epoch_period;
>   		rq->cpu_epoch += n;
> -		rq->cpu_epoch_next += n * EPOCH_PERIOD;
> +		rq->cpu_epoch_next += n * llc_epoch_period;
>   		__shr_u64(&rq->cpu_runtime, n);
>   	}
>   
> @@ -1412,7 +1461,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>   	 * has only 1 thread, or has too many active threads, invalidate
>   	 * its preferred state.
>   	 */
> -	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
> +	if (epoch - READ_ONCE(mm->mm_sched_epoch) > llc_epoch_affinity_timeout ||
>   	    get_nr_threads(p) <= 1 ||
>   	    exceed_llc_nr(mm, cpu_of(rq)) ||
>   	    exceed_llc_capacity(mm, cpu_of(rq))) {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 40798a06e058..15d126bd3728 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2852,6 +2852,11 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
>   #ifdef CONFIG_SCHED_CACHE
>   extern unsigned int llc_overload_pct;
>   extern unsigned int llc_imb_pct;
> +extern unsigned int llc_aggr_tolerance;
> +extern unsigned int llc_epoch_period;
> +extern unsigned int llc_epoch_affinity_timeout;
> +extern unsigned int llc_enabled;
> +void sched_cache_set(bool locked);
>   #endif
>   
>   #ifdef CONFIG_SCHED_HRTICK
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9799e3a9a609..818599ddaaef 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -26,6 +26,49 @@ int max_llcs;
>   
>   static bool sched_cache_present;
>   
> +unsigned int llc_enabled = 1;
> +DEFINE_STATIC_KEY_FALSE(sched_cache_on);
> +
> +/*
> + * Enable/disable cache aware scheduling according to
> + * user input and the presence of hardware support.
> + */
> +static void _sched_cache_set(bool enable, bool locked)
> +{
> +	if (enable) {
> +		if (locked)
> +			static_branch_enable_cpuslocked(&sched_cache_on);
> +		else
> +			static_branch_enable(&sched_cache_on);
> +	} else {
> +		if (locked)
> +			static_branch_disable_cpuslocked(&sched_cache_on);
> +		else
> +			static_branch_disable(&sched_cache_on);
> +	}
> +}
> +
> +void sched_cache_set(bool locked)
> +{
> +	/* hardware does not support */
> +	if (!sched_cache_present) {
> +		if (static_branch_likely(&sched_cache_on))
> +			_sched_cache_set(false, locked);
> +
> +		return;
> +	}
> +
> +	/* user wants it or not ?*/
> +	if (llc_enabled) {
> +		if (!static_branch_likely(&sched_cache_on))
> +			_sched_cache_set(true, locked);
> +
> +	} else {
> +		if (static_branch_likely(&sched_cache_on))
> +			_sched_cache_set(false, locked);
> +	}
> +}
> +
>   static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int **gc)
>   {
>   	unsigned int *new = NULL;
> @@ -70,8 +113,12 @@ static int resize_llc_pref(bool has_multi_llcs)
>   	 * new buffer.
>   	 */
>   	tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
> -	if (!tmp_llc_pref)
> -		return -ENOMEM;
> +	if (!tmp_llc_pref) {
> +		sched_cache_present = false;
> +		ret = -ENOMEM;
> +
> +		goto out;
> +	}
>   
>   	for_each_present_cpu(i)
>   		*per_cpu_ptr(tmp_llc_pref, i) = NULL;
> @@ -89,6 +136,7 @@ static int resize_llc_pref(bool has_multi_llcs)
>   		new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
>   		if (!new) {
>   			ret = -ENOMEM;
> +			sched_cache_present = false;
>   
>   			goto release_old;
>   		}
> @@ -126,6 +174,8 @@ static int resize_llc_pref(bool has_multi_llcs)
>   	if (!ret)
>   		max_llcs = new_max_llcs;
>   
> +out:
> +	sched_cache_set(true);
>   	return ret;
>   }
>   

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing
  2025-12-03 23:07 ` [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing Tim Chen
@ 2025-12-19  5:03   ` Yangyu Chen
  2025-12-19 14:41     ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: Yangyu Chen @ 2025-12-19  5:03 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel


> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> From: Chen Yu <yu.c.chen@intel.com>
> 
> Debug patch only.
> 
> With cache-aware load balancing enabled, statistics related to its activity
> are exposed via /proc/schedstat and debugfs. For instance, if users want to
> verify metrics like the number of exceeding RSS and nr_running limits, they
> can filter the output of /sys/kernel/debug/sched/debug and compute the required
> statistics manually:
> 
> llc_exceed_cap SUM: 6
> llc_exceed_nr SUM: 4531
> 
> Furthermore, these statistics exposed in /proc/schedstats can be queried manually
> or via perf sched stats[1] with minor modifications.
> 

Hi Tim,

This patch looks great, especially for multithread Verilator workloads
on clustered LLC (like AMD EPYC). I'm discussing with Verilator
upstream to disable automatic userspace affinity assignment in
Verilator if such feature exist [1]. During the discussion, I think
there should be a way for userspace software to detect if such a
feature exists. Could we expose it in `/proc/schedstats` to allow
userspace software to detect such a feature? We can just use this
patch and remove the "DO NOT APPLY" tag.

[1] https://github.com/verilator/verilator/issues/6826#issuecomment-3671287551

Thanks,
Yangyu Chen

> Link: https://lore.kernel.org/all/20250909114227.58802-1-swapnil.sapkal@amd.com #1
> 
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> include/linux/sched/topology.h | 1 +
> kernel/sched/fair.c            | 1 +
> kernel/sched/stats.c           | 5 +++--
> 3 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 0ba4697d74ba..8702c1e731a0 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -108,6 +108,7 @@ struct sched_domain {
> unsigned int lb_imbalance_util[CPU_MAX_IDLE_TYPES];
> unsigned int lb_imbalance_task[CPU_MAX_IDLE_TYPES];
> unsigned int lb_imbalance_misfit[CPU_MAX_IDLE_TYPES];
> + unsigned int lb_imbalance_llc[CPU_MAX_IDLE_TYPES];
> unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
> unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
> unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a2e2d6742481..742e455b093e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12684,6 +12684,7 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
> __schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
> break;
> case migrate_llc_task:
> + __schedstat_add(sd->lb_imbalance_llc[idle], env->imbalance);
> break;
> }
> }
> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
> index d1c9429a4ac5..3736f6102261 100644
> --- a/kernel/sched/stats.c
> +++ b/kernel/sched/stats.c
> @@ -104,7 +104,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
>  * Bump this up when changing the output format or the meaning of an existing
>  * format, so that tools can adapt (or abort)
>  */
> -#define SCHEDSTAT_VERSION 17
> +#define SCHEDSTAT_VERSION 18
> 
> static int show_schedstat(struct seq_file *seq, void *v)
> {
> @@ -139,7 +139,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
> seq_printf(seq, "domain%d %s %*pb", dcount++, sd->name,
>   cpumask_pr_args(sched_domain_span(sd)));
> for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
> - seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u",
> + seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u %u",
>    sd->lb_count[itype],
>    sd->lb_balanced[itype],
>    sd->lb_failed[itype],
> @@ -147,6 +147,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
>    sd->lb_imbalance_util[itype],
>    sd->lb_imbalance_task[itype],
>    sd->lb_imbalance_misfit[itype],
> +    sd->lb_imbalance_llc[itype],
>    sd->lb_gained[itype],
>    sd->lb_hot_gained[itype],
>    sd->lb_nobusyq[itype],
> -- 
> 2.32.0


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  2025-12-19  3:14         ` K Prateek Nayak
@ 2025-12-19 12:55           ` Chen, Yu C
  2025-12-22  2:49             ` Vern Hao
  2025-12-22  2:19           ` Vern Hao
  1 sibling, 1 reply; 111+ messages in thread
From: Chen, Yu C @ 2025-12-19 12:55 UTC (permalink / raw)
  To: K Prateek Nayak, Vern Hao
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Tim Chen,
	Peter Zijlstra, Vincent Guittot, Gautham R . Shenoy, Ingo Molnar

On 12/19/2025 11:14 AM, K Prateek Nayak wrote:
> Hello Vern,
> 
> On 12/18/2025 3:12 PM, Vern Hao wrote:
>>
>> On 2025/12/18 16:32, Chen, Yu C wrote:
>>> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>>>
>>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>>> From: Chen Yu <yu.c.chen@intel.com>
>>>>>
>>>>> Prateek and Tingyin reported that memory-intensive workloads (such as
>>>>> stream) can saturate memory bandwidth and caches on the preferred LLC
>>>>> when sched_cache aggregates too many threads.
>>>>>
>>>>> To mitigate this, estimate a process's memory footprint by comparing
>>>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>>>> exceeds the LLC size, skip cache-aware scheduling.
>>>> Restricting RSS prevents many applications from benefiting from this optimization. I believe this restriction should be lifted. For memory- intensive workloads, the optimization may simply yield no gains, but it certainly shouldn't make performance worse. We need to further refine this logic.
>>>
>>> Memory-intensive workloads may trigger performance regressions when
>>> memory bandwidth(from L3 cache to memory controller) is saturated due
>> RSS size and bandwidth saturation are not necessarily linked, In my view, the optimization should be robust enough that it doesn't cause a noticeable drop in performance, no matter how large the RSS is.
> 
> Easier said than done. I agree RSS size is not a clear indication of
> bandwidth saturation. With NUMA Balancing enabled, we can use the
> hinting faults to estimate the working set and make decisions but for
> systems that do not have NUMA, short of programming some performance
> counters, there is no real way to estimate the working set.
> 
> Hinting faults are known to cause overheads so enabling them without
> NUMA can cause noticeable overheads with no real benefits.
> 
>> We need to have a more profound discussion on this.
> 
> What do you have in mind?
> 
>  From where I stand, having the RSS based bailout for now won't make
> things worse for these tasks with huge memory reserves and when we can
> all agree on some generic method to estimate the working set of a task,
> we can always add it into exceed_llc_capacity().
>

Prateek, thanks very much for the practical callouts - using RSS seems to be
the best trade-off we can go with for now. Vern, I get your point about the
concern between RSS and actual memory footprint. However, detecting the 
working
set doesn’t seem to be accurate or generic in kernel space - even with
NUMA fault statistics sampling. One reliable way I can think of to
  detect the working set is in user space, via resctrl (Intel RDT, AMD QoS,
Arm MPAM). So maybe we can leverage that information to implement 
fine-grained
control on a per-process or per-task basis later.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 00/23] Cache aware scheduling
  2025-12-19  3:19 ` [PATCH v2 00/23] Cache aware scheduling Aaron Lu
@ 2025-12-19 13:04   ` Chen, Yu C
  0 siblings, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-19 13:04 UTC (permalink / raw)
  To: Aaron Lu, Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Tim Chen, linux-kernel

On 12/19/2025 11:19 AM, Aaron Lu wrote:
> On Wed, Dec 03, 2025 at 03:07:19PM -0800, Tim Chen wrote:
> ... ...
>> Test results:
>>
>> The patch series was applied and tested on v6.18-rc7.
>> See: https://github.com/timcchen1298/linux/commits/cache_aware_v2
>>
>> The first test platform is a 2 socket Intel Sapphire Rapids with 30
>> cores per socket. The DRAM interleaving is enabled in the BIOS so it
>> essential has one NUMA node with two last level caches. There are 60
>> CPUs associated with each last level cache.
>>
>> The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs
>> per node. Each node has 2 CCXs and each CCX has 16 CPUs.
>>
>> hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched on
>> these two platforms.
>>
>> [TL;DR]
>> Sappire Rapids:
>> hackbench shows significant improvement when the number of
>> different active threads is below the capacity of a LLC.
>> schbench shows overall wakeup latency improvement.
>> ChaCha20-xiangshan shows good throughput improvement.
>>
>> Genoa:
>> ChaCha20-xiangshan shows huge throughput improvement.
>> No obvious difference is observed in hackbench/schbench
> 
> I think for small task number hackbench run, there should be some
> improvement.
> 
> I tried thread/pipe/2fds/1group, i.e. 4 tasks on Genoa:
> ./hackbench -T -f 2 -g 1 -p -l 2000000
> And I noticed performance improved a lot:
> (Result in seconds, less is better)
> 
>         llc_off       llc_on          diff
> time   4.755±1.6%    2.684±6.25%    +43.6%
> 
> llc_off means /sys/kernel/debug/sched/llc_enabled set to 0 while
> llc_on means /sys/kernel/debug/sched/llc_enabled set to 1, other
> tunnables are left unchanged.
> Turbo is disabled and cpufreq set to performance.
> 
> I also tried redis and noticed when I set io-threads to 4 in redis.conf,
> there is also some improvement on AMD Genoa:
> 
>                   llc_off        manual      diff     llc_on      diff
> throughput      1536727±0%     1737619±0%  +13.1%   1737720±0%  +13.1%
> 
> Client cmdline:
> numactl -N 1 redis-benchmark --threads 4 -t set -r 100000 -P 16 -n 10000000
> server cmdline: numactl -N 0 redis-server ./redis.conf
> I also tried to manually bind all tasks of redis server to a single LLC
> to see if this workload benefits from aggregation and that's what manual
> means: taskset -c 8-15,200-207 redis-server ./redis.conf
> 
> According to the result, I think this 'cache aware scheduling' works
> as expected in that its performance is the same as manual binding; and
> they all beat llc_off.

Thanks Aaron for the test!

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling
  2025-12-19  4:14   ` Vern Hao
@ 2025-12-19 13:21     ` Chen, Yu C
  2025-12-19 13:39     ` Chen, Yu C
  1 sibling, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-19 13:21 UTC (permalink / raw)
  To: Vern Hao
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Tim Chen,
	Peter Zijlstra, K Prateek Nayak, Ingo Molnar, Gautham R . Shenoy,
	Vincent Guittot

On 12/19/2025 12:14 PM, Vern Hao wrote:
> 
> On 2025/12/4 07:07, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> Introduce a set of debugfs knobs to control the enabling of
>> and parameters for cache-aware load balancing.
>>
>> (1) llc_enabled
>> llc_enabled acts as the primary switch - users can toggle it to
>> enable or disable cache aware load balancing.
>>
>> (2) llc_aggr_tolerance
>> With sched_cache enabled, the scheduler uses a process's RSS as a
>> proxy for its LLC footprint to determine if aggregating tasks on the
>> preferred LLC could cause cache contention. If RSS exceeds the LLC
>> size, aggregation is skipped. Some workloads with large RSS but small
>> actual memory footprints may still benefit from aggregation. Since
>> the kernel cannot efficiently track per-task cache usage (resctrl is
>> user-space only), userspace can provide a more accurate hint.
>>
>> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
>> users control how strictly RSS limits aggregation. Values range from
>> 0 to 100:
>>
>>    - 0: Cache-aware scheduling is disabled.
>>    - 1: Strict; tasks with RSS larger than LLC size are skipped.
>>    - 100: Aggressive; tasks are aggregated regardless of RSS.
>>
>> For example, with a 32MB L3 cache:
>>
>>    - llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
>>    - llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
>>      (784GB = (1 + (99 - 1) * 256) * 32MB).
>>
>> Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
>> how strictly the number of active threads is considered when doing
>> cache aware load balance. The number of SMTs is also considered.
>> High SMT counts reduce the aggregation capacity, preventing excessive
>> task aggregation on SMT-heavy systems like Power10/Power11.
>>
>> For example, with 8 Cores/16 CPUs in a L3:
>>
>>    - llc_aggr_tolerance=1 -> tasks with nr_running > 8 are skipped.
>>    - llc_aggr_tolerance=99 -> tasks with nr_running > 785 are skipped
>>      785 = (1 + (99 - 1) * 8).
>>
>> (3) llc_epoch_period/llc_epoch_affinity_timeout
>> Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned
>> into tunable.
>>
>> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
>> Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
>> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> ---
>>
>> Notes:
>>      v1->v2: Remove the smt_nr check in fits_llc_capacity().
>>              (Aaron Lu)
>>
>>   include/linux/sched.h   |  4 ++-
>>   kernel/sched/debug.c    | 62 ++++++++++++++++++++++++++++++++++++++++
>>   kernel/sched/fair.c     | 63 ++++++++++++++++++++++++++++++++++++-----
>>   kernel/sched/sched.h    |  5 ++++
>>   kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++--
>>   5 files changed, 178 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 466ba8b7398c..95bf080bbbf0 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
>>   DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>>   #ifdef CONFIG_SCHED_CACHE
>> +DECLARE_STATIC_KEY_FALSE(sched_cache_on);
>> +
>>   static inline bool sched_cache_enabled(void)
>>   {
>> -    return false;
>> +    return static_branch_unlikely(&sched_cache_on);
>>   }
>>   #endif
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 02e16b70a790..cde324672103 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -169,6 +169,53 @@ static const struct file_operations 
>> sched_feat_fops = {
>>       .release    = single_release,
>>   };
>> +#ifdef CONFIG_SCHED_CACHE
>> +#define SCHED_CACHE_CREATE_CONTROL(name, max)              \
>> +static ssize_t sched_cache_write_##name(struct file *filp,      \
>> +                    const char __user *ubuf,  \
>> +                    size_t cnt, loff_t *ppos) \
>> +{                                  \
>> +    char buf[16];                          \
>> +    unsigned int val;                      \
>> +    if (cnt > 15)                          \
>> +        cnt = 15;                      \
>> +    if (copy_from_user(&buf, ubuf, cnt))              \
>> +        return -EFAULT;                      \
>> +    buf[cnt] = '\0';                      \
>> +    if (kstrtouint(buf, 10, &val))                  \
>> +        return -EINVAL;                      \
>> +    if (val > (max))                          \
>> +        return -EINVAL;                      \
>> +    llc_##name = val;                      \
>> +    if (!strcmp(#name, "enabled"))                  \
>> +        sched_cache_set(false);                  \
>> +    *ppos += cnt;                          \
>> +    return cnt;                          \
>> +}                                  \
>> +static int sched_cache_show_##name(struct seq_file *m, void *v)      \
>> +{                                  \
>> +    seq_printf(m, "%d\n", llc_##name);              \
>> +    return 0;                          \
>> +}                                  \
>> +static int sched_cache_open_##name(struct inode *inode,          \
>> +                   struct file *filp)          \
>> +{                                  \
>> +    return single_open(filp, sched_cache_show_##name, NULL);  \
>> +}                                  \
>> +static const struct file_operations sched_cache_fops_##name = {      \
>> +    .open        = sched_cache_open_##name,          \
>> +    .write        = sched_cache_write_##name,          \
>> +    .read        = seq_read,                  \
>> +    .llseek        = seq_lseek,                  \
>> +    .release    = single_release,              \
>> +}
>> +
>> +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
>> +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
>> +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
>> +SCHED_CACHE_CREATE_CONTROL(enabled, 1);
>> +#endif /* SCHED_CACHE */
>> +
>>   static ssize_t sched_scaling_write(struct file *filp, const char 
>> __user *ubuf,
>>                      size_t cnt, loff_t *ppos)
>>   {
>> @@ -523,6 +570,21 @@ static __init int sched_init_debug(void)
>>       debugfs_create_u32("hot_threshold_ms", 0644, numa, 
>> &sysctl_numa_balancing_hot_threshold);
>>   #endif /* CONFIG_NUMA_BALANCING */
>> +#ifdef CONFIG_SCHED_CACHE
>> +    debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
>> +                &sched_cache_fops_overload_pct);
>> +    debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
>> +                &sched_cache_fops_imb_pct);
>> +    debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
>> +                &sched_cache_fops_aggr_tolerance);
>> +    debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
>> +                &sched_cache_fops_enabled);
>> +    debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
>> +               &llc_epoch_period);
>> +    debugfs_create_u32("llc_epoch_affinity_timeout", 0644, 
>> debugfs_sched,
>> +               &llc_epoch_affinity_timeout);
>> +#endif
>> +
>>       debugfs_create_file("debug", 0444, debugfs_sched, NULL, 
>> &sched_debug_fops);
>>       debugfs_fair_server_init();
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 424ec601cfdf..a2e2d6742481 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct 
>> sched_entity *se)
>>   __read_mostly unsigned int llc_overload_pct       = 50;
>>   __read_mostly unsigned int llc_imb_pct            = 20;
>> +__read_mostly unsigned int llc_aggr_tolerance     = 1;
>> +__read_mostly unsigned int llc_epoch_period       = EPOCH_PERIOD;
>> +__read_mostly unsigned int llc_epoch_affinity_timeout = 
>> EPOCH_LLC_AFFINITY_TIMEOUT;
>>   static int llc_id(int cpu)
>>   {
>> @@ -1223,11 +1226,22 @@ static int llc_id(int cpu)
>>       return llc;
>>   }
>> +static inline int get_sched_cache_scale(int mul)
>> +{
>> +    if (!llc_aggr_tolerance)
>> +        return 0;
>> +
>> +    if (llc_aggr_tolerance == 100)
> the range of llc_aggr_tolerance is [0, 100], so a little bug here? maybe 
> check if (llc_aggr_tolerance >= 100)

I thought llc_aggr_tolerance should not exceed 100, in
sched_cache_write_aggr_tolerance(), if the input value is
higher than max, it will return invalid:
return -EINVAL;
I did a double check on this:

root@vm:/sys/kernel/debug/sched# echo 100 > llc_aggr_tolerance
root@vm:/sys/kernel/debug/sched# echo 101 > llc_aggr_tolerance
bash: echo: write error: Invalid argument

> 
> and if llc_aggr_tolerance = 0, the func return 0, it means 
> exceed_llc_capacity & exceed_llc_nr always true, there may be
> inconsistent to have this value set while |llc_enable=1| is set.
> 

If the llc_aggr_tolerance is 0, the cache aware scheduling is supposed
to be disabled - that is, exceed_llc_capacity() always returns true ->
the process is not eligible for cache aware scheduling.

thanks,
Chenyu





^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling
  2025-12-19  4:14   ` Vern Hao
  2025-12-19 13:21     ` Chen, Yu C
@ 2025-12-19 13:39     ` Chen, Yu C
  1 sibling, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-19 13:39 UTC (permalink / raw)
  To: Vern Hao
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Tim Chen,
	Peter Zijlstra, Gautham R . Shenoy, K Prateek Nayak,
	Vincent Guittot, Ingo Molnar

On 12/19/2025 12:14 PM, Vern Hao wrote:
> 
> On 2025/12/4 07:07, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>

> the range of llc_aggr_tolerance is [0, 100], so a little bug here? maybe 
> check if (llc_aggr_tolerance >= 100)
> 
> and if llc_aggr_tolerance = 0, the func return 0, it means 
> exceed_llc_capacity & exceed_llc_nr always true, there maybeinconsistent 
> to have this value set while |llc_enable=1| is set.
> 

I see your point. The original idea was that llc_aggr_tolerance and 
llc_enable
work together (independently) to determine whether cache-aware 
scheduling should
be enabled.  That is to say, llc_enable was not supposed be used as an
indicator for users to query whether the actual cache-aware scheduling 
is enabled.
Let me check if we can reset llc_enable if llc_aggr_tolerance is 0.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing
  2025-12-19  5:03   ` Yangyu Chen
@ 2025-12-19 14:41     ` Chen, Yu C
  2025-12-19 14:48       ` Yangyu Chen
  0 siblings, 1 reply; 111+ messages in thread
From: Chen, Yu C @ 2025-12-19 14:41 UTC (permalink / raw)
  To: Yangyu Chen, Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Tingyin Duan,
	Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel

On 12/19/2025 1:03 PM, Yangyu Chen wrote:
> 
>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> Debug patch only.
>>
>> With cache-aware load balancing enabled, statistics related to its activity
>> are exposed via /proc/schedstat and debugfs. For instance, if users want to
>> verify metrics like the number of exceeding RSS and nr_running limits, they
>> can filter the output of /sys/kernel/debug/sched/debug and compute the required
>> statistics manually:
>>
>> llc_exceed_cap SUM: 6
>> llc_exceed_nr SUM: 4531
>>
>> Furthermore, these statistics exposed in /proc/schedstats can be queried manually
>> or via perf sched stats[1] with minor modifications.
>>
> 
> Hi Tim,
> 
> This patch looks great, especially for multithread Verilator workloads
> on clustered LLC (like AMD EPYC). I'm discussing with Verilator
> upstream to disable automatic userspace affinity assignment in
> Verilator if such feature exist [1]. During the discussion, I think
> there should be a way for userspace software to detect if such a
> feature exists. Could we expose it in `/proc/schedstats` to allow
> userspace software to detect such a feature? We can just use this
> patch and remove the "DO NOT APPLY" tag.
> 

Thanks for the test Yangyu. Does /sys/kernel/debug/sched/llc_enabled
work for you?
Anyway we can try to include /proc/schedstats as the formal one
in the next version.

Thanks,
Chenyu

> [1] https://github.com/verilator/verilator/issues/6826#issuecomment-3671287551
> 
> Thanks,
> Yangyu Chen
> 

> 

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing
  2025-12-19 14:41     ` Chen, Yu C
@ 2025-12-19 14:48       ` Yangyu Chen
  0 siblings, 0 replies; 111+ messages in thread
From: Yangyu Chen @ 2025-12-19 14:48 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel



> On 19 Dec 2025, at 22:41, Chen, Yu C <yu.c.chen@intel.com> wrote:
> 
> On 12/19/2025 1:03 PM, Yangyu Chen wrote:
>>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>> 
>>> From: Chen Yu <yu.c.chen@intel.com>
>>> 
>>> Debug patch only.
>>> 
>>> With cache-aware load balancing enabled, statistics related to its activity
>>> are exposed via /proc/schedstat and debugfs. For instance, if users want to
>>> verify metrics like the number of exceeding RSS and nr_running limits, they
>>> can filter the output of /sys/kernel/debug/sched/debug and compute the required
>>> statistics manually:
>>> 
>>> llc_exceed_cap SUM: 6
>>> llc_exceed_nr SUM: 4531
>>> 
>>> Furthermore, these statistics exposed in /proc/schedstats can be queried manually
>>> or via perf sched stats[1] with minor modifications.
>>> 
>> Hi Tim,
>> This patch looks great, especially for multithread Verilator workloads
>> on clustered LLC (like AMD EPYC). I'm discussing with Verilator
>> upstream to disable automatic userspace affinity assignment in
>> Verilator if such feature exist [1]. During the discussion, I think
>> there should be a way for userspace software to detect if such a
>> feature exists. Could we expose it in `/proc/schedstats` to allow
>> userspace software to detect such a feature? We can just use this
>> patch and remove the "DO NOT APPLY" tag.
> 
> Thanks for the test Yangyu. Does /sys/kernel/debug/sched/llc_enabled
> work for you?

It requires debugfs being mounted with enough permissions. It’s not
feasible for normal user-space software without root permission.

Thanks,
Yangyu Chen

> Anyway we can try to include /proc/schedstats as the formal one
> in the next version.
> 
> Thanks,
> Chenyu
> 
>> [1] https://github.com/verilator/verilator/issues/6826#issuecomment-3671287551
>> Thanks,
>> Yangyu Chen



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  2025-12-19  3:14         ` K Prateek Nayak
  2025-12-19 12:55           ` Chen, Yu C
@ 2025-12-22  2:19           ` Vern Hao
  1 sibling, 0 replies; 111+ messages in thread
From: Vern Hao @ 2025-12-22  2:19 UTC (permalink / raw)
  To: K Prateek Nayak, Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Tim Chen,
	Peter Zijlstra, Vincent Guittot, Gautham R . Shenoy, Ingo Molnar,
	Vern Hao


On 2025/12/19 11:14, K Prateek Nayak wrote:
> Hello Vern,
>
> On 12/18/2025 3:12 PM, Vern Hao wrote:
>> On 2025/12/18 16:32, Chen, Yu C wrote:
>>> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>>> From: Chen Yu <yu.c.chen@intel.com>
>>>>>
>>>>> Prateek and Tingyin reported that memory-intensive workloads (such as
>>>>> stream) can saturate memory bandwidth and caches on the preferred LLC
>>>>> when sched_cache aggregates too many threads.
>>>>>
>>>>> To mitigate this, estimate a process's memory footprint by comparing
>>>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>>>> exceeds the LLC size, skip cache-aware scheduling.
>>>> Restricting RSS prevents many applications from benefiting from this optimization. I believe this restriction should be lifted. For memory- intensive workloads, the optimization may simply yield no gains, but it certainly shouldn't make performance worse. We need to further refine this logic.
>>> Memory-intensive workloads may trigger performance regressions when
>>> memory bandwidth(from L3 cache to memory controller) is saturated due
>> RSS size and bandwidth saturation are not necessarily linked, In my view, the optimization should be robust enough that it doesn't cause a noticeable drop in performance, no matter how large the RSS is.
> Easier said than done. I agree RSS size is not a clear indication of
> bandwidth saturation. With NUMA Balancing enabled, we can use the
> hinting faults to estimate the working set and make decisions but for
> systems that do not have NUMA, short of programming some performance
> counters, there is no real way to estimate the working set.
I see the challenge, but the reality is that many production workloads 
have large memory footprints and deserve to see performance gains as 
well. In my testing with Chen Yu on STREAM, it's intriguing that the 
performance is fine without |llc_enable| but drops significantly once 
it's turned on.I sincerely hope this situation can be optimized; 
otherwise, we won't be able to utilize these optimizations in 
large-memory scenarios.
>
> Hinting faults are known to cause overheads so enabling them without
> NUMA can cause noticeable overheads with no real benefits.
>
>> We need to have a more profound discussion on this.
> What do you have in mind?
I am wondering if we could address this through alternative approaches, 
such as reducing the migration frequency or preventing excessive task 
stacking within a single LLC. Of course, defining the right metrics to 
evaluate these conditions remains a significant challenge.
>
>  From where I stand, having the RSS based bailout for now won't make
> things worse for these tasks with huge memory reserves and when we can
> all agree on some generic method to estimate the working set of a task,
> we can always add it into exceed_llc_capacity().
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes
  2025-12-19 12:55           ` Chen, Yu C
@ 2025-12-22  2:49             ` Vern Hao
  0 siblings, 0 replies; 111+ messages in thread
From: Vern Hao @ 2025-12-22  2:49 UTC (permalink / raw)
  To: Chen, Yu C, K Prateek Nayak
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Tim Chen,
	Peter Zijlstra, Vincent Guittot, Gautham R . Shenoy, Ingo Molnar,
	Vern Hao


On 2025/12/19 20:55, Chen, Yu C wrote:
> On 12/19/2025 11:14 AM, K Prateek Nayak wrote:
>> Hello Vern,
>>
>> On 12/18/2025 3:12 PM, Vern Hao wrote:
>>>
>>> On 2025/12/18 16:32, Chen, Yu C wrote:
>>>> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>>>>
>>>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>>>> From: Chen Yu <yu.c.chen@intel.com>
>>>>>>
>>>>>> Prateek and Tingyin reported that memory-intensive workloads 
>>>>>> (such as
>>>>>> stream) can saturate memory bandwidth and caches on the preferred 
>>>>>> LLC
>>>>>> when sched_cache aggregates too many threads.
>>>>>>
>>>>>> To mitigate this, estimate a process's memory footprint by comparing
>>>>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>>>>> exceeds the LLC size, skip cache-aware scheduling.
>>>>> Restricting RSS prevents many applications from benefiting from 
>>>>> this optimization. I believe this restriction should be lifted. 
>>>>> For memory- intensive workloads, the optimization may simply yield 
>>>>> no gains, but it certainly shouldn't make performance worse. We 
>>>>> need to further refine this logic.
>>>>
>>>> Memory-intensive workloads may trigger performance regressions when
>>>> memory bandwidth(from L3 cache to memory controller) is saturated due
>>> RSS size and bandwidth saturation are not necessarily linked, In my 
>>> view, the optimization should be robust enough that it doesn't cause 
>>> a noticeable drop in performance, no matter how large the RSS is.
>>
>> Easier said than done. I agree RSS size is not a clear indication of
>> bandwidth saturation. With NUMA Balancing enabled, we can use the
>> hinting faults to estimate the working set and make decisions but for
>> systems that do not have NUMA, short of programming some performance
>> counters, there is no real way to estimate the working set.
>>
>> Hinting faults are known to cause overheads so enabling them without
>> NUMA can cause noticeable overheads with no real benefits.
>>
>>> We need to have a more profound discussion on this.
>>
>> What do you have in mind?
>>
>>  From where I stand, having the RSS based bailout for now won't make
>> things worse for these tasks with huge memory reserves and when we can
>> all agree on some generic method to estimate the working set of a task,
>> we can always add it into exceed_llc_capacity().
>>
>
> Prateek, thanks very much for the practical callouts - using RSS seems 
> to be
> the best trade-off we can go with for now. Vern, I get your point 
> about the
> concern between RSS and actual memory footprint. However, detecting 
> the working
> set doesn’t seem to be accurate or generic in kernel space - even with
> NUMA fault statistics sampling. One reliable way I can think of to
>  detect the working set is in user space, via resctrl (Intel RDT, AMD 
> QoS,
> Arm MPAM). So maybe we can leverage that information to implement 
> fine-grained
> control on a per-process or per-task basis later.
OK, I agree, thanks.
>
> thanks,
> Chenyu

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 04/23] sched/cache: Make LLC id continuous
  2025-12-03 23:07 ` [PATCH v2 04/23] sched/cache: Make LLC id continuous Tim Chen
  2025-12-09 11:58   ` Peter Zijlstra
@ 2025-12-23  5:31   ` K Prateek Nayak
  2025-12-24  7:08     ` Chen, Yu C
  1 sibling, 1 reply; 111+ messages in thread
From: K Prateek Nayak @ 2025-12-23  5:31 UTC (permalink / raw)
  To: Tim Chen, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy,
	Vincent Guittot
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel

Hello Tim, Chenyu,

On 12/4/2025 4:37 AM, Tim Chen wrote:
> +/*
> + * Assign continuous llc id for the CPU, and return
> + * the assigned llc id.
> + */
> +static int update_llc_id(struct sched_domain *sd,
> +			 int cpu)
> +{
> +	int id = per_cpu(sd_llc_id, cpu), i;
> +
> +	if (id >= 0)
> +		return id;
> +
> +	if (sd) {
> +		/* Look for any assigned id and reuse it.*/
> +		for_each_cpu(i, sched_domain_span(sd)) {
> +			id = per_cpu(sd_llc_id, i);
> +
> +			if (id >= 0) {
> +				per_cpu(sd_llc_id, cpu) = id;
> +				return id;
> +			}
> +		}
> +	}

I don't really like tying this down to the sched_domain span since
partition and other weirdness can cause the max_llc count to go
unnecessarily high. The tl->mask() (from sched_domain_topology_level)
should give the mask considering all online CPUs and not bothering
about cpusets.

How about something like:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b17d8e3cb55..c19b1c4e6472 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8270,6 +8270,18 @@ static void cpuset_cpu_active(void)
 static void cpuset_cpu_inactive(unsigned int cpu)
 {
 	if (!cpuhp_tasks_frozen) {
+		/*
+		 * This is necessary since offline CPUs are
+		 * taken out of the tl->mask() and a newly
+		 * onlined CPU in same LLC will not realize
+		 * whether it should reuse the LLC ID owned
+		 * by an offline CPU without knowing the
+		 * LLC association.
+		 *
+		 * Safe to release the reference if this is
+		 * the last CPU in the LLC going offline.
+		 */
+		sched_domain_free_llc_id(cpu);
 		cpuset_update_active_cpus();
 	} else {
 		num_cpus_frozen++;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 41caa22e0680..1378a1cfad18 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -631,6 +631,7 @@ void update_sched_domain_debugfs(void)
 			i++;
 		}
 
+		debugfs_create_u32("llc_id", 0444, d_cpu, (u32 *)per_cpu_ptr(&sd_llc_id, cpu));
 		__cpumask_clear_cpu(cpu, sd_sysctl_cpus);
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3ceaa9dc9a9e..69fad88b57d8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2142,6 +2142,7 @@ extern int group_balance_cpu(struct sched_group *sg);
 
 extern void update_sched_domain_debugfs(void);
 extern void dirty_sched_domain_sysctl(int cpu);
+void sched_domain_free_llc_id(int cpu);
 
 extern int sched_update_scaling(void);
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5ddedd..d6e134767f30 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -20,6 +20,46 @@ void sched_domains_mutex_unlock(void)
 /* Protected by sched_domains_mutex: */
 static cpumask_var_t sched_domains_tmpmask;
 static cpumask_var_t sched_domains_tmpmask2;
+static cpumask_var_t sched_llc_id_alloc_mask;
+DEFINE_PER_CPU(int, sd_llc_id) = -1;
+static int max_llcs = 0;
+
+static inline int sched_domain_alloc_llc_id(void)
+{
+	int llc_id;
+
+	lockdep_assert_held(&sched_domains_mutex);
+
+	llc_id = cpumask_first_zero(sched_llc_id_alloc_mask);
+	BUG_ON((unsigned int)llc_id >= nr_cpumask_bits);
+	cpumask_set_cpu(llc_id, sched_llc_id_alloc_mask);
+	++max_llcs;
+
+	return llc_id;
+}
+
+void sched_domain_free_llc_id(int cpu)
+{
+	int i, llc_id = per_cpu(sd_llc_id, cpu);
+	bool found = false;
+
+	lockdep_assert_cpus_held(); /* For cpu_active_mask. */
+	guard(mutex)(&sched_domains_mutex);
+
+	per_cpu(sd_llc_id, cpu) = -1;
+	for_each_cpu(i, cpu_active_mask) {
+		if (per_cpu(sd_llc_id, i) == llc_id) {
+			found = true;
+			break;
+		}
+	}
+
+	/* Allow future hotplugs to claim this ID */
+	if (!found) {
+		cpumask_clear_cpu(llc_id, sched_llc_id_alloc_mask);
+		--max_llcs;
+	}
+}
 
 static int __init sched_debug_setup(char *str)
 {
@@ -658,7 +698,6 @@ static void destroy_sched_domains(struct sched_domain *sd)
  */
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
-DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(int, sd_share_id);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
@@ -684,7 +723,6 @@ static void update_top_cache_domain(int cpu)
 
 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;
-	per_cpu(sd_llc_id, cpu) = id;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
 	sd = lowest_flag_domain(cpu, SD_CLUSTER);
@@ -2567,10 +2605,35 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 
 	/* Set up domains for CPUs specified by the cpu_map: */
 	for_each_cpu(i, cpu_map) {
-		struct sched_domain_topology_level *tl;
+		struct sched_domain_topology_level *tl, *tl_llc = NULL;
+		bool done = false;
 
 		sd = NULL;
 		for_each_sd_topology(tl) {
+			int flags = 0;
+
+			if (tl->sd_flags)
+				flags = (*tl->sd_flags)();
+
+			if (flags & SD_SHARE_LLC) {
+				tl_llc = tl;
+
+				/*
+				 * Entire cpu_map has been covered. We are
+				 * traversing only to find the highest
+				 * SD_SHARE_LLC level.
+				 */
+				if (done)
+					continue;
+			}
+
+			/*
+			 * Since SD_SHARE_LLC is SDF_SHARED_CHILD, we can
+			 * safely break out if the entire cpu_map has been
+			 * covered by a child domain.
+			 */
+			if (done)
+				break;
 
 			sd = build_sched_domain(tl, cpu_map, attr, sd, i);
 
@@ -2579,7 +2642,41 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 			if (tl == sched_domain_topology)
 				*per_cpu_ptr(d.sd, i) = sd;
 			if (cpumask_equal(cpu_map, sched_domain_span(sd)))
-				break;
+				done = true;
+		}
+
+		/* First time visiting this CPU. Assign the llc_id. */
+		if (per_cpu(sd_llc_id, i) == -1) {
+			int j, llc_id = -1;
+
+			/*
+			 * In case there are no SD_SHARE_LLC domains,
+			 * each CPU gets its own llc_id. Find the first
+			 * free bit on the mask and use it.
+			 */
+			if (!tl_llc) {
+				per_cpu(sd_llc_id, i) = sched_domain_alloc_llc_id();
+				continue;
+			}
+
+			/*
+			 * Visit all the CPUs of the LLC irrespective of the
+			 * partition constraints and find if any of them have
+			 * a valid llc_id.
+			 */
+			for_each_cpu(j, tl_llc->mask(tl, i)) {
+				llc_id = per_cpu(sd_llc_id, j);
+
+				/* Found a valid llc_id for CPU's LLC. */
+				if (llc_id != -1)
+					break;
+			}
+
+			/* Valid llc_id not found. Allocate a new one. */
+			if (llc_id == -1)
+				llc_id = sched_domain_alloc_llc_id();
+
+			per_cpu(sd_llc_id, i) = llc_id;
 		}
 	}
 
@@ -2759,6 +2856,7 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
 
 	zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
 	zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
+	zalloc_cpumask_var(&sched_llc_id_alloc_mask, GFP_KERNEL);
 	zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
 
 	arch_update_cpu_topology();
---

AFAICT, "sd_llc_id" isn't compared across different partitions so having
the CPUs that are actually associated with same physical LLC but across
different partitions sharing the same "sd_llc_id" shouldn't be a problem.

Thoughts?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling
  2025-12-03 23:07 ` [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling Tim Chen
  2025-12-10 17:02   ` Peter Zijlstra
  2025-12-19  4:14   ` Vern Hao
@ 2025-12-23 12:12   ` Yangyu Chen
  2025-12-23 16:44     ` Yangyu Chen
  2 siblings, 1 reply; 111+ messages in thread
From: Yangyu Chen @ 2025-12-23 12:12 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel



> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> From: Chen Yu <yu.c.chen@intel.com>
> 
> Introduce a set of debugfs knobs to control the enabling of
> and parameters for cache-aware load balancing.
> 
> (1) llc_enabled
> llc_enabled acts as the primary switch - users can toggle it to
> enable or disable cache aware load balancing.
> 
> (2) llc_aggr_tolerance
> With sched_cache enabled, the scheduler uses a process's RSS as a
> proxy for its LLC footprint to determine if aggregating tasks on the
> preferred LLC could cause cache contention. If RSS exceeds the LLC
> size, aggregation is skipped. Some workloads with large RSS but small
> actual memory footprints may still benefit from aggregation. Since
> the kernel cannot efficiently track per-task cache usage (resctrl is
> user-space only), userspace can provide a more accurate hint.
> 
> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
> users control how strictly RSS limits aggregation. Values range from
> 0 to 100:
> 
>  - 0: Cache-aware scheduling is disabled.
>  - 1: Strict; tasks with RSS larger than LLC size are skipped.
>  - 100: Aggressive; tasks are aggregated regardless of RSS.
> 

Hi Chen Yu and Tim Chen,

Maybe we should have something like prctl(PR_LLC_AGGR_TOLERANCE, 100).

I have tested this version of the patch on my EPYC Milan 7V13 (7763 variant) physical machine, with 32M LLC for each 8-core CCX. I found that I need to tune "llc_aggr_tolerance" to 100, else I can't get cache-aware scheduling to work on Verilated [1] XiangShan [2] running the chacha20 [3] as I mentioned before [4].

But if I set it to 100, I will lose some performance on stream copy benchmarks since the bandwidth is limited per CCX. Thus, I think we should have a new prctl to let userspace software hint the kernel that this task can be bound by latency between cores, and should use this feature no matter the RSS exceed the LLC size.

I finally have an EPYC Milan at home today. I'm glad to test this patch further. And I'm very willing to have this thus we can have a fast verilator[1] without numactl manually.

[1] https://github.com/verilator/verilator
[2] https://github.com/OpenXiangShan/Xiangshan
[3] https://github.com/cyyself/chacha20-xiangshan
[4] https://lore.kernel.org/lkml/tencent_6E51A3175F8AE0A7F684A319EE63CC56C806@qq.com/

Thanks,
Yangyu Chen

> For exaImple, with a 32MB L3 cache:
> 
>  - llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
>  - llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
>    (784GB = (1 + (99 - 1) * 256) * 32MB).
> 
> Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
> how strictly the number of active threads is considered when doing
> cache aware load balance. The number of SMTs is also considered.
> High SMT counts reduce the aggregation capacity, preventing excessive
> task aggregation on SMT-heavy systems like Power10/Power11.
> 
> For example, with 8 Cores/16 CPUs in a L3:
> 
>  - llc_aggr_tolerance=1 -> tasks with nr_running > 8 are skipped.
>  - llc_aggr_tolerance=99 -> tasks with nr_running > 785 are skipped
>    785 = (1 + (99 - 1) * 8).
> 
> (3) llc_epoch_period/llc_epoch_affinity_timeout
> Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned
> into tunable.
> 
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
> Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
> 
> Notes:
>    v1->v2: Remove the smt_nr check in fits_llc_capacity().
>            (Aaron Lu)
> 
> include/linux/sched.h   |  4 ++-
> kernel/sched/debug.c    | 62 ++++++++++++++++++++++++++++++++++++++++
> kernel/sched/fair.c     | 63 ++++++++++++++++++++++++++++++++++++-----
> kernel/sched/sched.h    |  5 ++++
> kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++--
> 5 files changed, 178 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 466ba8b7398c..95bf080bbbf0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
> DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
> 
> #ifdef CONFIG_SCHED_CACHE
> +DECLARE_STATIC_KEY_FALSE(sched_cache_on);
> +
> static inline bool sched_cache_enabled(void)
> {
> - return false;
> + return static_branch_unlikely(&sched_cache_on);
> }
> #endif
> 
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 02e16b70a790..cde324672103 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops = {
> .release = single_release,
> };
> 
> +#ifdef CONFIG_SCHED_CACHE
> +#define SCHED_CACHE_CREATE_CONTROL(name, max)  \
> +static ssize_t sched_cache_write_##name(struct file *filp,  \
> + const char __user *ubuf,  \
> + size_t cnt, loff_t *ppos) \
> +{  \
> + char buf[16];  \
> + unsigned int val;  \
> + if (cnt > 15)  \
> + cnt = 15;  \
> + if (copy_from_user(&buf, ubuf, cnt))  \
> + return -EFAULT;  \
> + buf[cnt] = '\0';  \
> + if (kstrtouint(buf, 10, &val))  \
> + return -EINVAL;  \
> + if (val > (max))  \
> + return -EINVAL;  \
> + llc_##name = val;  \
> + if (!strcmp(#name, "enabled"))  \
> + sched_cache_set(false);  \
> + *ppos += cnt;  \
> + return cnt;  \
> +}  \
> +static int sched_cache_show_##name(struct seq_file *m, void *v)  \
> +{  \
> + seq_printf(m, "%d\n", llc_##name);  \
> + return 0;  \
> +}  \
> +static int sched_cache_open_##name(struct inode *inode,  \
> +   struct file *filp)  \
> +{  \
> + return single_open(filp, sched_cache_show_##name, NULL);  \
> +}  \
> +static const struct file_operations sched_cache_fops_##name = {  \
> + .open = sched_cache_open_##name,  \
> + .write = sched_cache_write_##name,  \
> + .read = seq_read,  \
> + .llseek = seq_lseek,  \
> + .release = single_release,  \
> +}
> +
> +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
> +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
> +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
> +SCHED_CACHE_CREATE_CONTROL(enabled, 1);
> +#endif /* SCHED_CACHE */
> +
> static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf,
>   size_t cnt, loff_t *ppos)
> {
> @@ -523,6 +570,21 @@ static __init int sched_init_debug(void)
> debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
> #endif /* CONFIG_NUMA_BALANCING */
> 
> +#ifdef CONFIG_SCHED_CACHE
> + debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
> +    &sched_cache_fops_overload_pct);
> + debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
> +    &sched_cache_fops_imb_pct);
> + debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
> +    &sched_cache_fops_aggr_tolerance);
> + debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
> +    &sched_cache_fops_enabled);
> + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
> +   &llc_epoch_period);
> + debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
> +   &llc_epoch_affinity_timeout);
> +#endif
> +
> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
> 
> debugfs_fair_server_init();
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 424ec601cfdf..a2e2d6742481 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
> 
> __read_mostly unsigned int llc_overload_pct       = 50;
> __read_mostly unsigned int llc_imb_pct            = 20;
> +__read_mostly unsigned int llc_aggr_tolerance     = 1;
> +__read_mostly unsigned int llc_epoch_period       = EPOCH_PERIOD;
> +__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
> 
> static int llc_id(int cpu)
> {
> @@ -1223,11 +1226,22 @@ static int llc_id(int cpu)
> return llc;
> }
> 
> +static inline int get_sched_cache_scale(int mul)
> +{
> + if (!llc_aggr_tolerance)
> + return 0;
> +
> + if (llc_aggr_tolerance == 100)
> + return INT_MAX;
> +
> + return (1 + (llc_aggr_tolerance - 1) * mul);
> +}
> +
> static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
> {
> + unsigned int llc, scale;
> struct cacheinfo *ci;
> unsigned long rss;
> - unsigned int llc;
> 
> /*
> * get_cpu_cacheinfo_level() can not be used
> @@ -1252,19 +1266,54 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
> rss = get_mm_counter(mm, MM_ANONPAGES) +
> get_mm_counter(mm, MM_SHMEMPAGES);
> 
> - return (llc <= (rss * PAGE_SIZE));
> + /*
> + * Scale the LLC size by 256*llc_aggr_tolerance
> + * and compare it to the task's RSS size.
> + *
> + * Suppose the L3 size is 32MB. If the
> + * llc_aggr_tolerance is 1:
> + * When the RSS is larger than 32MB, the process
> + * is regarded as exceeding the LLC capacity. If
> + * the llc_aggr_tolerance is 99:
> + * When the RSS is larger than 784GB, the process
> + * is regarded as exceeding the LLC capacity because:
> + * 784GB = (1 + (99 - 1) * 256) * 32MB
> + */
> + scale = get_sched_cache_scale(256);
> + if (scale == INT_MAX)
> + return false;
> +
> + return ((llc * scale) <= (rss * PAGE_SIZE));
> }
> 
> static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
> {
> - int smt_nr = 1;
> + int smt_nr = 1, scale;
> 
> #ifdef CONFIG_SCHED_SMT
> if (sched_smt_active())
> smt_nr = cpumask_weight(cpu_smt_mask(cpu));
> #endif
> + /*
> + * Scale the Core number in a LLC by llc_aggr_tolerance
> + * and compare it to the task's active threads.
> + *
> + * Suppose the number of Cores in LLC is 8.
> + * Every core has 2 SMTs.
> + * If the llc_aggr_tolerance is 1: When the
> + * nr_running is larger than 8, the process
> + * is regarded as exceeding the LLC capacity.
> + * If the llc_aggr_tolerance is 99:
> + * When the nr_running is larger than 785,
> + * the process is regarded as exceeding
> + * the LLC capacity:
> + * 785 = 1 + (99 - 1) * 8
> + */
> + scale = get_sched_cache_scale(1);
> + if (scale == INT_MAX)
> + return false;
> 
> - return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
> + return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu)));
> }
> 
> static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
> @@ -1350,9 +1399,9 @@ static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
> long delta = now - rq->cpu_epoch_next;
> 
> if (delta > 0) {
> - n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
> + n = (delta + llc_epoch_period - 1) / llc_epoch_period;
> rq->cpu_epoch += n;
> - rq->cpu_epoch_next += n * EPOCH_PERIOD;
> + rq->cpu_epoch_next += n * llc_epoch_period;
> __shr_u64(&rq->cpu_runtime, n);
> }
> 
> @@ -1412,7 +1461,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> * has only 1 thread, or has too many active threads, invalidate
> * its preferred state.
> */
> - if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
> + if (epoch - READ_ONCE(mm->mm_sched_epoch) > llc_epoch_affinity_timeout ||
>    get_nr_threads(p) <= 1 ||
>    exceed_llc_nr(mm, cpu_of(rq)) ||
>    exceed_llc_capacity(mm, cpu_of(rq))) {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 40798a06e058..15d126bd3728 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2852,6 +2852,11 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
> #ifdef CONFIG_SCHED_CACHE
> extern unsigned int llc_overload_pct;
> extern unsigned int llc_imb_pct;
> +extern unsigned int llc_aggr_tolerance;
> +extern unsigned int llc_epoch_period;
> +extern unsigned int llc_epoch_affinity_timeout;
> +extern unsigned int llc_enabled;
> +void sched_cache_set(bool locked);
> #endif
> 
> #ifdef CONFIG_SCHED_HRTICK
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 9799e3a9a609..818599ddaaef 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -26,6 +26,49 @@ int max_llcs;
> 
> static bool sched_cache_present;
> 
> +unsigned int llc_enabled = 1;
> +DEFINE_STATIC_KEY_FALSE(sched_cache_on);
> +
> +/*
> + * Enable/disable cache aware scheduling according to
> + * user input and the presence of hardware support.
> + */
> +static void _sched_cache_set(bool enable, bool locked)
> +{
> + if (enable) {
> + if (locked)
> + static_branch_enable_cpuslocked(&sched_cache_on);
> + else
> + static_branch_enable(&sched_cache_on);
> + } else {
> + if (locked)
> + static_branch_disable_cpuslocked(&sched_cache_on);
> + else
> + static_branch_disable(&sched_cache_on);
> + }
> +}
> +
> +void sched_cache_set(bool locked)
> +{
> + /* hardware does not support */
> + if (!sched_cache_present) {
> + if (static_branch_likely(&sched_cache_on))
> + _sched_cache_set(false, locked);
> +
> + return;
> + }
> +
> + /* user wants it or not ?*/
> + if (llc_enabled) {
> + if (!static_branch_likely(&sched_cache_on))
> + _sched_cache_set(true, locked);
> +
> + } else {
> + if (static_branch_likely(&sched_cache_on))
> + _sched_cache_set(false, locked);
> + }
> +}
> +
> static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int **gc)
> {
> unsigned int *new = NULL;
> @@ -70,8 +113,12 @@ static int resize_llc_pref(bool has_multi_llcs)
> * new buffer.
> */
> tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
> - if (!tmp_llc_pref)
> - return -ENOMEM;
> + if (!tmp_llc_pref) {
> + sched_cache_present = false;
> + ret = -ENOMEM;
> +
> + goto out;
> + }
> 
> for_each_present_cpu(i)
> *per_cpu_ptr(tmp_llc_pref, i) = NULL;
> @@ -89,6 +136,7 @@ static int resize_llc_pref(bool has_multi_llcs)
> new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
> if (!new) {
> ret = -ENOMEM;
> + sched_cache_present = false;
> 
> goto release_old;
> }
> @@ -126,6 +174,8 @@ static int resize_llc_pref(bool has_multi_llcs)
> if (!ret)
> max_llcs = new_max_llcs;
> 
> +out:
> + sched_cache_set(true);
> return ret;
> }
> 
> -- 
> 2.32.0


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling
  2025-12-23 12:12   ` Yangyu Chen
@ 2025-12-23 16:44     ` Yangyu Chen
  2025-12-24  3:28       ` Yangyu Chen
  0 siblings, 1 reply; 111+ messages in thread
From: Yangyu Chen @ 2025-12-23 16:44 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel



> On 23 Dec 2025, at 20:12, Yangyu Chen <cyy@cyyself.name> wrote:
> 
>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>> 
>> From: Chen Yu <yu.c.chen@intel.com>
>> 
>> Introduce a set of debugfs knobs to control the enabling of
>> and parameters for cache-aware load balancing.
>> 
>> (1) llc_enabled
>> llc_enabled acts as the primary switch - users can toggle it to
>> enable or disable cache aware load balancing.
>> 
>> (2) llc_aggr_tolerance
>> With sched_cache enabled, the scheduler uses a process's RSS as a
>> proxy for its LLC footprint to determine if aggregating tasks on the
>> preferred LLC could cause cache contention. If RSS exceeds the LLC
>> size, aggregation is skipped. Some workloads with large RSS but small
>> actual memory footprints may still benefit from aggregation. Since
>> the kernel cannot efficiently track per-task cache usage (resctrl is
>> user-space only), userspace can provide a more accurate hint.
>> 
>> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
>> users control how strictly RSS limits aggregation. Values range from
>> 0 to 100:
>> 
>> - 0: Cache-aware scheduling is disabled.
>> - 1: Strict; tasks with RSS larger than LLC size are skipped.
>> - 100: Aggressive; tasks are aggregated regardless of RSS.
>> 
> 
> Hi Chen Yu and Tim Chen,
> 
> Maybe we should have something like prctl(PR_LLC_AGGR_TOLERANCE, 100).
> 
> I have tested this version of the patch on my EPYC Milan 7V13 (7763 variant) physical machine, with 32M LLC for each 8-core CCX. I found that I need to tune "llc_aggr_tolerance" to 100, else I can't get cache-aware scheduling to work on Verilated [1] XiangShan [2] running the chacha20 [3] as I mentioned before [4].
> 

In addition, I have investigated why this happens. And finally I
realized that's because that workload observed 35596 kB RssAnon on
my EPYC Milan Machine, slightly exceeding the LLC size (32M). I
have tested it on an EPYC Genoa cloud server with the correct core
/ cache hierarchy in ACPI table, that shows 31700 kB RssAnon, thus
fitting in LLC. I have no idea why my result shows higher RssAnon,
since they both run Debian Trixie with the exact same kernel and
same executable. But it reminds me we should have a userspace API
for that.

Thanks,
Yangyu Chen
 
> But if I set it to 100, I will lose some performance on stream copy benchmarks since the bandwidth is limited per CCX. Thus, I think we should have a new prctl to let userspace software hint the kernel that this task can be bound by latency between cores, and should use this feature no matter the RSS exceed the LLC size.
> 
> I finally have an EPYC Milan at home today. I'm glad to test this patch further. And I'm very willing to have this thus we can have a fast verilator[1] without numactl manually.
> 
> [1] https://github.com/verilator/verilator
> [2] https://github.com/OpenXiangShan/Xiangshan
> [3] https://github.com/cyyself/chacha20-xiangshan
> [4] https://lore.kernel.org/lkml/tencent_6E51A3175F8AE0A7F684A319EE63CC56C806@qq.com/
> 
> Thanks,
> Yangyu Chen
> 
>> For exaImple, with a 32MB L3 cache:
>> 
>> - llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
>> - llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
>>   (784GB = (1 + (99 - 1) * 256) * 32MB).
>> 
>> Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
>> how strictly the number of active threads is considered when doing
>> cache aware load balance. The number of SMTs is also considered.
>> High SMT counts reduce the aggregation capacity, preventing excessive
>> task aggregation on SMT-heavy systems like Power10/Power11.
>> 
>> For example, with 8 Cores/16 CPUs in a L3:
>> 
>> - llc_aggr_tolerance=1 -> tasks with nr_running > 8 are skipped.
>> - llc_aggr_tolerance=99 -> tasks with nr_running > 785 are skipped
>>   785 = (1 + (99 - 1) * 8).
>> 
>> (3) llc_epoch_period/llc_epoch_affinity_timeout
>> Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned
>> into tunable.
>> 
>> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
>> Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
>> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> ---
>> 
>> Notes:
>>   v1->v2: Remove the smt_nr check in fits_llc_capacity().
>>           (Aaron Lu)
>> 
>> include/linux/sched.h   |  4 ++-
>> kernel/sched/debug.c    | 62 ++++++++++++++++++++++++++++++++++++++++
>> kernel/sched/fair.c     | 63 ++++++++++++++++++++++++++++++++++++-----
>> kernel/sched/sched.h    |  5 ++++
>> kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++--
>> 5 files changed, 178 insertions(+), 10 deletions(-)
>> 
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 466ba8b7398c..95bf080bbbf0 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
>> DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>> 
>> #ifdef CONFIG_SCHED_CACHE
>> +DECLARE_STATIC_KEY_FALSE(sched_cache_on);
>> +
>> static inline bool sched_cache_enabled(void)
>> {
>> - return false;
>> + return static_branch_unlikely(&sched_cache_on);
>> }
>> #endif
>> 
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 02e16b70a790..cde324672103 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops = {
>> .release = single_release,
>> };
>> 
>> +#ifdef CONFIG_SCHED_CACHE
>> +#define SCHED_CACHE_CREATE_CONTROL(name, max)  \
>> +static ssize_t sched_cache_write_##name(struct file *filp,  \
>> + const char __user *ubuf,  \
>> + size_t cnt, loff_t *ppos) \
>> +{  \
>> + char buf[16];  \
>> + unsigned int val;  \
>> + if (cnt > 15)  \
>> + cnt = 15;  \
>> + if (copy_from_user(&buf, ubuf, cnt))  \
>> + return -EFAULT;  \
>> + buf[cnt] = '\0';  \
>> + if (kstrtouint(buf, 10, &val))  \
>> + return -EINVAL;  \
>> + if (val > (max))  \
>> + return -EINVAL;  \
>> + llc_##name = val;  \
>> + if (!strcmp(#name, "enabled"))  \
>> + sched_cache_set(false);  \
>> + *ppos += cnt;  \
>> + return cnt;  \
>> +}  \
>> +static int sched_cache_show_##name(struct seq_file *m, void *v)  \
>> +{  \
>> + seq_printf(m, "%d\n", llc_##name);  \
>> + return 0;  \
>> +}  \
>> +static int sched_cache_open_##name(struct inode *inode,  \
>> +   struct file *filp)  \
>> +{  \
>> + return single_open(filp, sched_cache_show_##name, NULL);  \
>> +}  \
>> +static const struct file_operations sched_cache_fops_##name = {  \
>> + .open = sched_cache_open_##name,  \
>> + .write = sched_cache_write_##name,  \
>> + .read = seq_read,  \
>> + .llseek = seq_lseek,  \
>> + .release = single_release,  \
>> +}
>> +
>> +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
>> +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
>> +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
>> +SCHED_CACHE_CREATE_CONTROL(enabled, 1);
>> +#endif /* SCHED_CACHE */
>> +
>> static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf,
>>  size_t cnt, loff_t *ppos)
>> {
>> @@ -523,6 +570,21 @@ static __init int sched_init_debug(void)
>> debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
>> #endif /* CONFIG_NUMA_BALANCING */
>> 
>> +#ifdef CONFIG_SCHED_CACHE
>> + debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
>> +    &sched_cache_fops_overload_pct);
>> + debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
>> +    &sched_cache_fops_imb_pct);
>> + debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
>> +    &sched_cache_fops_aggr_tolerance);
>> + debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
>> +    &sched_cache_fops_enabled);
>> + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
>> +   &llc_epoch_period);
>> + debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
>> +   &llc_epoch_affinity_timeout);
>> +#endif
>> +
>> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>> 
>> debugfs_fair_server_init();
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 424ec601cfdf..a2e2d6742481 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>> 
>> __read_mostly unsigned int llc_overload_pct       = 50;
>> __read_mostly unsigned int llc_imb_pct            = 20;
>> +__read_mostly unsigned int llc_aggr_tolerance     = 1;
>> +__read_mostly unsigned int llc_epoch_period       = EPOCH_PERIOD;
>> +__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
>> 
>> static int llc_id(int cpu)
>> {
>> @@ -1223,11 +1226,22 @@ static int llc_id(int cpu)
>> return llc;
>> }
>> 
>> +static inline int get_sched_cache_scale(int mul)
>> +{
>> + if (!llc_aggr_tolerance)
>> + return 0;
>> +
>> + if (llc_aggr_tolerance == 100)
>> + return INT_MAX;
>> +
>> + return (1 + (llc_aggr_tolerance - 1) * mul);
>> +}
>> +
>> static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>> {
>> + unsigned int llc, scale;
>> struct cacheinfo *ci;
>> unsigned long rss;
>> - unsigned int llc;
>> 
>> /*
>> * get_cpu_cacheinfo_level() can not be used
>> @@ -1252,19 +1266,54 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>> rss = get_mm_counter(mm, MM_ANONPAGES) +
>> get_mm_counter(mm, MM_SHMEMPAGES);
>> 
>> - return (llc <= (rss * PAGE_SIZE));
>> + /*
>> + * Scale the LLC size by 256*llc_aggr_tolerance
>> + * and compare it to the task's RSS size.
>> + *
>> + * Suppose the L3 size is 32MB. If the
>> + * llc_aggr_tolerance is 1:
>> + * When the RSS is larger than 32MB, the process
>> + * is regarded as exceeding the LLC capacity. If
>> + * the llc_aggr_tolerance is 99:
>> + * When the RSS is larger than 784GB, the process
>> + * is regarded as exceeding the LLC capacity because:
>> + * 784GB = (1 + (99 - 1) * 256) * 32MB
>> + */
>> + scale = get_sched_cache_scale(256);
>> + if (scale == INT_MAX)
>> + return false;
>> +
>> + return ((llc * scale) <= (rss * PAGE_SIZE));
>> }
>> 
>> static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>> {
>> - int smt_nr = 1;
>> + int smt_nr = 1, scale;
>> 
>> #ifdef CONFIG_SCHED_SMT
>> if (sched_smt_active())
>> smt_nr = cpumask_weight(cpu_smt_mask(cpu));
>> #endif
>> + /*
>> + * Scale the Core number in a LLC by llc_aggr_tolerance
>> + * and compare it to the task's active threads.
>> + *
>> + * Suppose the number of Cores in LLC is 8.
>> + * Every core has 2 SMTs.
>> + * If the llc_aggr_tolerance is 1: When the
>> + * nr_running is larger than 8, the process
>> + * is regarded as exceeding the LLC capacity.
>> + * If the llc_aggr_tolerance is 99:
>> + * When the nr_running is larger than 785,
>> + * the process is regarded as exceeding
>> + * the LLC capacity:
>> + * 785 = 1 + (99 - 1) * 8
>> + */
>> + scale = get_sched_cache_scale(1);
>> + if (scale == INT_MAX)
>> + return false;
>> 
>> - return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
>> + return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu)));
>> }
>> 
>> static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
>> @@ -1350,9 +1399,9 @@ static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
>> long delta = now - rq->cpu_epoch_next;
>> 
>> if (delta > 0) {
>> - n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
>> + n = (delta + llc_epoch_period - 1) / llc_epoch_period;
>> rq->cpu_epoch += n;
>> - rq->cpu_epoch_next += n * EPOCH_PERIOD;
>> + rq->cpu_epoch_next += n * llc_epoch_period;
>> __shr_u64(&rq->cpu_runtime, n);
>> }
>> 
>> @@ -1412,7 +1461,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>> * has only 1 thread, or has too many active threads, invalidate
>> * its preferred state.
>> */
>> - if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
>> + if (epoch - READ_ONCE(mm->mm_sched_epoch) > llc_epoch_affinity_timeout ||
>>   get_nr_threads(p) <= 1 ||
>>   exceed_llc_nr(mm, cpu_of(rq)) ||
>>   exceed_llc_capacity(mm, cpu_of(rq))) {
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 40798a06e058..15d126bd3728 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2852,6 +2852,11 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
>> #ifdef CONFIG_SCHED_CACHE
>> extern unsigned int llc_overload_pct;
>> extern unsigned int llc_imb_pct;
>> +extern unsigned int llc_aggr_tolerance;
>> +extern unsigned int llc_epoch_period;
>> +extern unsigned int llc_epoch_affinity_timeout;
>> +extern unsigned int llc_enabled;
>> +void sched_cache_set(bool locked);
>> #endif
>> 
>> #ifdef CONFIG_SCHED_HRTICK
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 9799e3a9a609..818599ddaaef 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -26,6 +26,49 @@ int max_llcs;
>> 
>> static bool sched_cache_present;
>> 
>> +unsigned int llc_enabled = 1;
>> +DEFINE_STATIC_KEY_FALSE(sched_cache_on);
>> +
>> +/*
>> + * Enable/disable cache aware scheduling according to
>> + * user input and the presence of hardware support.
>> + */
>> +static void _sched_cache_set(bool enable, bool locked)
>> +{
>> + if (enable) {
>> + if (locked)
>> + static_branch_enable_cpuslocked(&sched_cache_on);
>> + else
>> + static_branch_enable(&sched_cache_on);
>> + } else {
>> + if (locked)
>> + static_branch_disable_cpuslocked(&sched_cache_on);
>> + else
>> + static_branch_disable(&sched_cache_on);
>> + }
>> +}
>> +
>> +void sched_cache_set(bool locked)
>> +{
>> + /* hardware does not support */
>> + if (!sched_cache_present) {
>> + if (static_branch_likely(&sched_cache_on))
>> + _sched_cache_set(false, locked);
>> +
>> + return;
>> + }
>> +
>> + /* user wants it or not ?*/
>> + if (llc_enabled) {
>> + if (!static_branch_likely(&sched_cache_on))
>> + _sched_cache_set(true, locked);
>> +
>> + } else {
>> + if (static_branch_likely(&sched_cache_on))
>> + _sched_cache_set(false, locked);
>> + }
>> +}
>> +
>> static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int **gc)
>> {
>> unsigned int *new = NULL;
>> @@ -70,8 +113,12 @@ static int resize_llc_pref(bool has_multi_llcs)
>> * new buffer.
>> */
>> tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
>> - if (!tmp_llc_pref)
>> - return -ENOMEM;
>> + if (!tmp_llc_pref) {
>> + sched_cache_present = false;
>> + ret = -ENOMEM;
>> +
>> + goto out;
>> + }
>> 
>> for_each_present_cpu(i)
>> *per_cpu_ptr(tmp_llc_pref, i) = NULL;
>> @@ -89,6 +136,7 @@ static int resize_llc_pref(bool has_multi_llcs)
>> new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
>> if (!new) {
>> ret = -ENOMEM;
>> + sched_cache_present = false;
>> 
>> goto release_old;
>> }
>> @@ -126,6 +174,8 @@ static int resize_llc_pref(bool has_multi_llcs)
>> if (!ret)
>> max_llcs = new_max_llcs;
>> 
>> +out:
>> + sched_cache_set(true);
>> return ret;
>> }
>> 
>> -- 
>> 2.32.0
> 
>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>> 
>> From: Chen Yu <yu.c.chen@intel.com>
>> 
>> Introduce a set of debugfs knobs to control the enabling of
>> and parameters for cache-aware load balancing.
>> 
>> (1) llc_enabled
>> llc_enabled acts as the primary switch - users can toggle it to
>> enable or disable cache aware load balancing.
>> 
>> (2) llc_aggr_tolerance
>> With sched_cache enabled, the scheduler uses a process's RSS as a
>> proxy for its LLC footprint to determine if aggregating tasks on the
>> preferred LLC could cause cache contention. If RSS exceeds the LLC
>> size, aggregation is skipped. Some workloads with large RSS but small
>> actual memory footprints may still benefit from aggregation. Since
>> the kernel cannot efficiently track per-task cache usage (resctrl is
>> user-space only), userspace can provide a more accurate hint.
>> 
>> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
>> users control how strictly RSS limits aggregation. Values range from
>> 0 to 100:
>> 
>> - 0: Cache-aware scheduling is disabled.
>> - 1: Strict; tasks with RSS larger than LLC size are skipped.
>> - 100: Aggressive; tasks are aggregated regardless of RSS.
>> 
> 
> Hi Chen Yu and Tim Chen,
> 
> Maybe we should have something like prctl(PR_LLC_AGGR_TOLERANCE, 100).
> 
> I have tested this version of the patch on my EPYC Milan 7V13 (7763 variant) physical machine, with 32M LLC for each 8-core CCX. I found that I need to tune "llc_aggr_tolerance" to 100, else I can't get cache-aware scheduling to work on Verilated [1] XiangShan [2] running the chacha20 [3] as I mentioned before [4].
> 
> But if I set it to 100, I will lose some performance on stream copy benchmarks since the bandwidth is limited per CCX. Thus, I think we should have a new prctl to let userspace software hint the kernel that this task can be bound by latency between cores, and should use this feature no matter the RSS exceed the LLC size.
> 
> I finally have an EPYC Milan at home today. I'm glad to test this patch further. And I'm very willing to have this thus we can have a fast verilator[1] without numactl manually.
> 
> [1] https://github.com/verilator/verilator
> [2] https://github.com/OpenXiangShan/Xiangshan
> [3] https://github.com/cyyself/chacha20-xiangshan
> [4] https://lore.kernel.org/lkml/tencent_6E51A3175F8AE0A7F684A319EE63CC56C806@qq.com/
> 
> Thanks,
> Yangyu Chen
> 
>> For exaImple, with a 32MB L3 cache:
>> 
>> - llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
>> - llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
>>   (784GB = (1 + (99 - 1) * 256) * 32MB).
>> 
>> Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
>> how strictly the number of active threads is considered when doing
>> cache aware load balance. The number of SMTs is also considered.
>> High SMT counts reduce the aggregation capacity, preventing excessive
>> task aggregation on SMT-heavy systems like Power10/Power11.
>> 
>> For example, with 8 Cores/16 CPUs in a L3:
>> 
>> - llc_aggr_tolerance=1 -> tasks with nr_running > 8 are skipped.
>> - llc_aggr_tolerance=99 -> tasks with nr_running > 785 are skipped
>>   785 = (1 + (99 - 1) * 8).
>> 
>> (3) llc_epoch_period/llc_epoch_affinity_timeout
>> Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned
>> into tunable.
>> 
>> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
>> Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
>> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> ---
>> 
>> Notes:
>>   v1->v2: Remove the smt_nr check in fits_llc_capacity().
>>           (Aaron Lu)
>> 
>> include/linux/sched.h   |  4 ++-
>> kernel/sched/debug.c    | 62 ++++++++++++++++++++++++++++++++++++++++
>> kernel/sched/fair.c     | 63 ++++++++++++++++++++++++++++++++++++-----
>> kernel/sched/sched.h    |  5 ++++
>> kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++--
>> 5 files changed, 178 insertions(+), 10 deletions(-)
>> 
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 466ba8b7398c..95bf080bbbf0 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
>> DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>> 
>> #ifdef CONFIG_SCHED_CACHE
>> +DECLARE_STATIC_KEY_FALSE(sched_cache_on);
>> +
>> static inline bool sched_cache_enabled(void)
>> {
>> - return false;
>> + return static_branch_unlikely(&sched_cache_on);
>> }
>> #endif
>> 
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 02e16b70a790..cde324672103 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops = {
>> .release = single_release,
>> };
>> 
>> +#ifdef CONFIG_SCHED_CACHE
>> +#define SCHED_CACHE_CREATE_CONTROL(name, max)  \
>> +static ssize_t sched_cache_write_##name(struct file *filp,  \
>> + const char __user *ubuf,  \
>> + size_t cnt, loff_t *ppos) \
>> +{  \
>> + char buf[16];  \
>> + unsigned int val;  \
>> + if (cnt > 15)  \
>> + cnt = 15;  \
>> + if (copy_from_user(&buf, ubuf, cnt))  \
>> + return -EFAULT;  \
>> + buf[cnt] = '\0';  \
>> + if (kstrtouint(buf, 10, &val))  \
>> + return -EINVAL;  \
>> + if (val > (max))  \
>> + return -EINVAL;  \
>> + llc_##name = val;  \
>> + if (!strcmp(#name, "enabled"))  \
>> + sched_cache_set(false);  \
>> + *ppos += cnt;  \
>> + return cnt;  \
>> +}  \
>> +static int sched_cache_show_##name(struct seq_file *m, void *v)  \
>> +{  \
>> + seq_printf(m, "%d\n", llc_##name);  \
>> + return 0;  \
>> +}  \
>> +static int sched_cache_open_##name(struct inode *inode,  \
>> +   struct file *filp)  \
>> +{  \
>> + return single_open(filp, sched_cache_show_##name, NULL);  \
>> +}  \
>> +static const struct file_operations sched_cache_fops_##name = {  \
>> + .open = sched_cache_open_##name,  \
>> + .write = sched_cache_write_##name,  \
>> + .read = seq_read,  \
>> + .llseek = seq_lseek,  \
>> + .release = single_release,  \
>> +}
>> +
>> +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
>> +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
>> +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
>> +SCHED_CACHE_CREATE_CONTROL(enabled, 1);
>> +#endif /* SCHED_CACHE */
>> +
>> static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf,
>>  size_t cnt, loff_t *ppos)
>> {
>> @@ -523,6 +570,21 @@ static __init int sched_init_debug(void)
>> debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
>> #endif /* CONFIG_NUMA_BALANCING */
>> 
>> +#ifdef CONFIG_SCHED_CACHE
>> + debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
>> +    &sched_cache_fops_overload_pct);
>> + debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
>> +    &sched_cache_fops_imb_pct);
>> + debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
>> +    &sched_cache_fops_aggr_tolerance);
>> + debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
>> +    &sched_cache_fops_enabled);
>> + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
>> +   &llc_epoch_period);
>> + debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
>> +   &llc_epoch_affinity_timeout);
>> +#endif
>> +
>> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>> 
>> debugfs_fair_server_init();
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 424ec601cfdf..a2e2d6742481 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>> 
>> __read_mostly unsigned int llc_overload_pct       = 50;
>> __read_mostly unsigned int llc_imb_pct            = 20;
>> +__read_mostly unsigned int llc_aggr_tolerance     = 1;
>> +__read_mostly unsigned int llc_epoch_period       = EPOCH_PERIOD;
>> +__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
>> 
>> static int llc_id(int cpu)
>> {
>> @@ -1223,11 +1226,22 @@ static int llc_id(int cpu)
>> return llc;
>> }
>> 
>> +static inline int get_sched_cache_scale(int mul)
>> +{
>> + if (!llc_aggr_tolerance)
>> + return 0;
>> +
>> + if (llc_aggr_tolerance == 100)
>> + return INT_MAX;
>> +
>> + return (1 + (llc_aggr_tolerance - 1) * mul);
>> +}
>> +
>> static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>> {
>> + unsigned int llc, scale;
>> struct cacheinfo *ci;
>> unsigned long rss;
>> - unsigned int llc;
>> 
>> /*
>> * get_cpu_cacheinfo_level() can not be used
>> @@ -1252,19 +1266,54 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>> rss = get_mm_counter(mm, MM_ANONPAGES) +
>> get_mm_counter(mm, MM_SHMEMPAGES);
>> 
>> - return (llc <= (rss * PAGE_SIZE));
>> + /*
>> + * Scale the LLC size by 256*llc_aggr_tolerance
>> + * and compare it to the task's RSS size.
>> + *
>> + * Suppose the L3 size is 32MB. If the
>> + * llc_aggr_tolerance is 1:
>> + * When the RSS is larger than 32MB, the process
>> + * is regarded as exceeding the LLC capacity. If
>> + * the llc_aggr_tolerance is 99:
>> + * When the RSS is larger than 784GB, the process
>> + * is regarded as exceeding the LLC capacity because:
>> + * 784GB = (1 + (99 - 1) * 256) * 32MB
>> + */
>> + scale = get_sched_cache_scale(256);
>> + if (scale == INT_MAX)
>> + return false;
>> +
>> + return ((llc * scale) <= (rss * PAGE_SIZE));
>> }
>> 
>> static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>> {
>> - int smt_nr = 1;
>> + int smt_nr = 1, scale;
>> 
>> #ifdef CONFIG_SCHED_SMT
>> if (sched_smt_active())
>> smt_nr = cpumask_weight(cpu_smt_mask(cpu));
>> #endif
>> + /*
>> + * Scale the Core number in a LLC by llc_aggr_tolerance
>> + * and compare it to the task's active threads.
>> + *
>> + * Suppose the number of Cores in LLC is 8.
>> + * Every core has 2 SMTs.
>> + * If the llc_aggr_tolerance is 1: When the
>> + * nr_running is larger than 8, the process
>> + * is regarded as exceeding the LLC capacity.
>> + * If the llc_aggr_tolerance is 99:
>> + * When the nr_running is larger than 785,
>> + * the process is regarded as exceeding
>> + * the LLC capacity:
>> + * 785 = 1 + (99 - 1) * 8
>> + */
>> + scale = get_sched_cache_scale(1);
>> + if (scale == INT_MAX)
>> + return false;
>> 
>> - return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
>> + return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu)));
>> }
>> 
>> static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
>> @@ -1350,9 +1399,9 @@ static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
>> long delta = now - rq->cpu_epoch_next;
>> 
>> if (delta > 0) {
>> - n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
>> + n = (delta + llc_epoch_period - 1) / llc_epoch_period;
>> rq->cpu_epoch += n;
>> - rq->cpu_epoch_next += n * EPOCH_PERIOD;
>> + rq->cpu_epoch_next += n * llc_epoch_period;
>> __shr_u64(&rq->cpu_runtime, n);
>> }
>> 
>> @@ -1412,7 +1461,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>> * has only 1 thread, or has too many active threads, invalidate
>> * its preferred state.
>> */
>> - if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
>> + if (epoch - READ_ONCE(mm->mm_sched_epoch) > llc_epoch_affinity_timeout ||
>>   get_nr_threads(p) <= 1 ||
>>   exceed_llc_nr(mm, cpu_of(rq)) ||
>>   exceed_llc_capacity(mm, cpu_of(rq))) {
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 40798a06e058..15d126bd3728 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2852,6 +2852,11 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
>> #ifdef CONFIG_SCHED_CACHE
>> extern unsigned int llc_overload_pct;
>> extern unsigned int llc_imb_pct;
>> +extern unsigned int llc_aggr_tolerance;
>> +extern unsigned int llc_epoch_period;
>> +extern unsigned int llc_epoch_affinity_timeout;
>> +extern unsigned int llc_enabled;
>> +void sched_cache_set(bool locked);
>> #endif
>> 
>> #ifdef CONFIG_SCHED_HRTICK
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 9799e3a9a609..818599ddaaef 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -26,6 +26,49 @@ int max_llcs;
>> 
>> static bool sched_cache_present;
>> 
>> +unsigned int llc_enabled = 1;
>> +DEFINE_STATIC_KEY_FALSE(sched_cache_on);
>> +
>> +/*
>> + * Enable/disable cache aware scheduling according to
>> + * user input and the presence of hardware support.
>> + */
>> +static void _sched_cache_set(bool enable, bool locked)
>> +{
>> + if (enable) {
>> + if (locked)
>> + static_branch_enable_cpuslocked(&sched_cache_on);
>> + else
>> + static_branch_enable(&sched_cache_on);
>> + } else {
>> + if (locked)
>> + static_branch_disable_cpuslocked(&sched_cache_on);
>> + else
>> + static_branch_disable(&sched_cache_on);
>> + }
>> +}
>> +
>> +void sched_cache_set(bool locked)
>> +{
>> + /* hardware does not support */
>> + if (!sched_cache_present) {
>> + if (static_branch_likely(&sched_cache_on))
>> + _sched_cache_set(false, locked);
>> +
>> + return;
>> + }
>> +
>> + /* user wants it or not ?*/
>> + if (llc_enabled) {
>> + if (!static_branch_likely(&sched_cache_on))
>> + _sched_cache_set(true, locked);
>> +
>> + } else {
>> + if (static_branch_likely(&sched_cache_on))
>> + _sched_cache_set(false, locked);
>> + }
>> +}
>> +
>> static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int **gc)
>> {
>> unsigned int *new = NULL;
>> @@ -70,8 +113,12 @@ static int resize_llc_pref(bool has_multi_llcs)
>> * new buffer.
>> */
>> tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
>> - if (!tmp_llc_pref)
>> - return -ENOMEM;
>> + if (!tmp_llc_pref) {
>> + sched_cache_present = false;
>> + ret = -ENOMEM;
>> +
>> + goto out;
>> + }
>> 
>> for_each_present_cpu(i)
>> *per_cpu_ptr(tmp_llc_pref, i) = NULL;
>> @@ -89,6 +136,7 @@ static int resize_llc_pref(bool has_multi_llcs)
>> new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
>> if (!new) {
>> ret = -ENOMEM;
>> + sched_cache_present = false;
>> 
>> goto release_old;
>> }
>> @@ -126,6 +174,8 @@ static int resize_llc_pref(bool has_multi_llcs)
>> if (!ret)
>> max_llcs = new_max_llcs;
>> 
>> +out:
>> + sched_cache_set(true);
>> return ret;
>> }
>> 
>> -- 
>> 2.32.0



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling
  2025-12-23 16:44     ` Yangyu Chen
@ 2025-12-24  3:28       ` Yangyu Chen
  2025-12-24  7:51         ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: Yangyu Chen @ 2025-12-24  3:28 UTC (permalink / raw)
  To: Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Chen Yu, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel



> On 24 Dec 2025, at 00:44, Yangyu Chen <cyy@cyyself.name> wrote:
> 
>> On 23 Dec 2025, at 20:12, Yangyu Chen <cyy@cyyself.name> wrote:
>> 
>>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>> 
>>> From: Chen Yu <yu.c.chen@intel.com>
>>> 
>>> Introduce a set of debugfs knobs to control the enabling of
>>> and parameters for cache-aware load balancing.
>>> 
>>> (1) llc_enabled
>>> llc_enabled acts as the primary switch - users can toggle it to
>>> enable or disable cache aware load balancing.
>>> 
>>> (2) llc_aggr_tolerance
>>> With sched_cache enabled, the scheduler uses a process's RSS as a
>>> proxy for its LLC footprint to determine if aggregating tasks on the
>>> preferred LLC could cause cache contention. If RSS exceeds the LLC
>>> size, aggregation is skipped. Some workloads with large RSS but small
>>> actual memory footprints may still benefit from aggregation. Since
>>> the kernel cannot efficiently track per-task cache usage (resctrl is
>>> user-space only), userspace can provide a more accurate hint.
>>> 
>>> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
>>> users control how strictly RSS limits aggregation. Values range from
>>> 0 to 100:
>>> 
>>> - 0: Cache-aware scheduling is disabled.
>>> - 1: Strict; tasks with RSS larger than LLC size are skipped.
>>> - 100: Aggressive; tasks are aggregated regardless of RSS.
>>> 
>> 
>> Hi Chen Yu and Tim Chen,
>> 
>> Maybe we should have something like prctl(PR_LLC_AGGR_TOLERANCE, 100).
>> 
>> I have tested this version of the patch on my EPYC Milan 7V13 (7763 variant) physical machine, with 32M LLC for each 8-core CCX. I found that I need to tune "llc_aggr_tolerance" to 100, else I can't get cache-aware scheduling to work on Verilated [1] XiangShan [2] running the chacha20 [3] as I mentioned before [4].
>> 
> 
> In addition, I have investigated why this happens. And finally I
> realized that's because that workload observed 35596 kB RssAnon on
> my EPYC Milan Machine, slightly exceeding the LLC size (32M). I
> have tested it on an EPYC Genoa cloud server with the correct core
> / cache hierarchy in ACPI table, that shows 31700 kB RssAnon, thus
> fitting in LLC. I have no idea why my result shows higher RssAnon,
> since they both run Debian Trixie with the exact same kernel and
> same executable. But it reminds me we should have a userspace API
> for that.
> 
> Thanks,
> Yangyu Chen
> 

In addition, during profiling the verilator, I found that if scheduled
to SMTs, it will result in poor performance. Thus, I think we should
separate the control for rss size with the SMT scale.

It's notable that rss size is not the actual memory footprint. It
would be better if we could measure the l2_miss event or l3_miss
event to measure the l3 hit rate. Just for future work.

I'm willing to provide a patch for such a prctl. But I'm busy these
days, maybe I can have the time to do that after one week.

Thanks,
Yangyu chen

>> But if I set it to 100, I will lose some performance on stream copy benchmarks since the bandwidth is limited per CCX. Thus, I think we should have a new prctl to let userspace software hint the kernel that this task can be bound by latency between cores, and should use this feature no matter the RSS exceed the LLC size.
>> 
>> I finally have an EPYC Milan at home today. I'm glad to test this patch further. And I'm very willing to have this thus we can have a fast verilator[1] without numactl manually.
>> 
>> [1] https://github.com/verilator/verilator
>> [2] https://github.com/OpenXiangShan/Xiangshan
>> [3] https://github.com/cyyself/chacha20-xiangshan
>> [4] https://lore.kernel.org/lkml/tencent_6E51A3175F8AE0A7F684A319EE63CC56C806@qq.com/
>> 
>> Thanks,
>> Yangyu Chen
>> 
>>> For exaImple, with a 32MB L3 cache:
>>> 
>>> - llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
>>> - llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
>>>  (784GB = (1 + (99 - 1) * 256) * 32MB).
>>> 
>>> Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
>>> how strictly the number of active threads is considered when doing
>>> cache aware load balance. The number of SMTs is also considered.
>>> High SMT counts reduce the aggregation capacity, preventing excessive
>>> task aggregation on SMT-heavy systems like Power10/Power11.
>>> 
>>> For example, with 8 Cores/16 CPUs in a L3:
>>> 
>>> - llc_aggr_tolerance=1 -> tasks with nr_running > 8 are skipped.
>>> - llc_aggr_tolerance=99 -> tasks with nr_running > 785 are skipped
>>>  785 = (1 + (99 - 1) * 8).
>>> 
>>> (3) llc_epoch_period/llc_epoch_affinity_timeout
>>> Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned
>>> into tunable.
>>> 
>>> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>> Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
>>> Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>> Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
>>> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>> ---
>>> 
>>> Notes:
>>>  v1->v2: Remove the smt_nr check in fits_llc_capacity().
>>>          (Aaron Lu)
>>> 
>>> include/linux/sched.h   |  4 ++-
>>> kernel/sched/debug.c    | 62 ++++++++++++++++++++++++++++++++++++++++
>>> kernel/sched/fair.c     | 63 ++++++++++++++++++++++++++++++++++++-----
>>> kernel/sched/sched.h    |  5 ++++
>>> kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++--
>>> 5 files changed, 178 insertions(+), 10 deletions(-)
>>> 
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index 466ba8b7398c..95bf080bbbf0 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
>>> DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>>> 
>>> #ifdef CONFIG_SCHED_CACHE
>>> +DECLARE_STATIC_KEY_FALSE(sched_cache_on);
>>> +
>>> static inline bool sched_cache_enabled(void)
>>> {
>>> - return false;
>>> + return static_branch_unlikely(&sched_cache_on);
>>> }
>>> #endif
>>> 
>>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>>> index 02e16b70a790..cde324672103 100644
>>> --- a/kernel/sched/debug.c
>>> +++ b/kernel/sched/debug.c
>>> @@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops = {
>>> .release = single_release,
>>> };
>>> 
>>> +#ifdef CONFIG_SCHED_CACHE
>>> +#define SCHED_CACHE_CREATE_CONTROL(name, max)  \
>>> +static ssize_t sched_cache_write_##name(struct file *filp,  \
>>> + const char __user *ubuf,  \
>>> + size_t cnt, loff_t *ppos) \
>>> +{  \
>>> + char buf[16];  \
>>> + unsigned int val;  \
>>> + if (cnt > 15)  \
>>> + cnt = 15;  \
>>> + if (copy_from_user(&buf, ubuf, cnt))  \
>>> + return -EFAULT;  \
>>> + buf[cnt] = '\0';  \
>>> + if (kstrtouint(buf, 10, &val))  \
>>> + return -EINVAL;  \
>>> + if (val > (max))  \
>>> + return -EINVAL;  \
>>> + llc_##name = val;  \
>>> + if (!strcmp(#name, "enabled"))  \
>>> + sched_cache_set(false);  \
>>> + *ppos += cnt;  \
>>> + return cnt;  \
>>> +}  \
>>> +static int sched_cache_show_##name(struct seq_file *m, void *v)  \
>>> +{  \
>>> + seq_printf(m, "%d\n", llc_##name);  \
>>> + return 0;  \
>>> +}  \
>>> +static int sched_cache_open_##name(struct inode *inode,  \
>>> +   struct file *filp)  \
>>> +{  \
>>> + return single_open(filp, sched_cache_show_##name, NULL);  \
>>> +}  \
>>> +static const struct file_operations sched_cache_fops_##name = {  \
>>> + .open = sched_cache_open_##name,  \
>>> + .write = sched_cache_write_##name,  \
>>> + .read = seq_read,  \
>>> + .llseek = seq_lseek,  \
>>> + .release = single_release,  \
>>> +}
>>> +
>>> +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
>>> +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
>>> +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
>>> +SCHED_CACHE_CREATE_CONTROL(enabled, 1);
>>> +#endif /* SCHED_CACHE */
>>> +
>>> static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf,
>>> size_t cnt, loff_t *ppos)
>>> {
>>> @@ -523,6 +570,21 @@ static __init int sched_init_debug(void)
>>> debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
>>> #endif /* CONFIG_NUMA_BALANCING */
>>> 
>>> +#ifdef CONFIG_SCHED_CACHE
>>> + debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_overload_pct);
>>> + debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_imb_pct);
>>> + debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_aggr_tolerance);
>>> + debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_enabled);
>>> + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
>>> +   &llc_epoch_period);
>>> + debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
>>> +   &llc_epoch_affinity_timeout);
>>> +#endif
>>> +
>>> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>>> 
>>> debugfs_fair_server_init();
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 424ec601cfdf..a2e2d6742481 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>>> 
>>> __read_mostly unsigned int llc_overload_pct       = 50;
>>> __read_mostly unsigned int llc_imb_pct            = 20;
>>> +__read_mostly unsigned int llc_aggr_tolerance     = 1;
>>> +__read_mostly unsigned int llc_epoch_period       = EPOCH_PERIOD;
>>> +__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
>>> 
>>> static int llc_id(int cpu)
>>> {
>>> @@ -1223,11 +1226,22 @@ static int llc_id(int cpu)
>>> return llc;
>>> }
>>> 
>>> +static inline int get_sched_cache_scale(int mul)
>>> +{
>>> + if (!llc_aggr_tolerance)
>>> + return 0;
>>> +
>>> + if (llc_aggr_tolerance == 100)
>>> + return INT_MAX;
>>> +
>>> + return (1 + (llc_aggr_tolerance - 1) * mul);
>>> +}
>>> +
>>> static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>>> {
>>> + unsigned int llc, scale;
>>> struct cacheinfo *ci;
>>> unsigned long rss;
>>> - unsigned int llc;
>>> 
>>> /*
>>> * get_cpu_cacheinfo_level() can not be used
>>> @@ -1252,19 +1266,54 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>>> rss = get_mm_counter(mm, MM_ANONPAGES) +
>>> get_mm_counter(mm, MM_SHMEMPAGES);
>>> 
>>> - return (llc <= (rss * PAGE_SIZE));
>>> + /*
>>> + * Scale the LLC size by 256*llc_aggr_tolerance
>>> + * and compare it to the task's RSS size.
>>> + *
>>> + * Suppose the L3 size is 32MB. If the
>>> + * llc_aggr_tolerance is 1:
>>> + * When the RSS is larger than 32MB, the process
>>> + * is regarded as exceeding the LLC capacity. If
>>> + * the llc_aggr_tolerance is 99:
>>> + * When the RSS is larger than 784GB, the process
>>> + * is regarded as exceeding the LLC capacity because:
>>> + * 784GB = (1 + (99 - 1) * 256) * 32MB
>>> + */
>>> + scale = get_sched_cache_scale(256);
>>> + if (scale == INT_MAX)
>>> + return false;
>>> +
>>> + return ((llc * scale) <= (rss * PAGE_SIZE));
>>> }
>>> 
>>> static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>>> {
>>> - int smt_nr = 1;
>>> + int smt_nr = 1, scale;
>>> 
>>> #ifdef CONFIG_SCHED_SMT
>>> if (sched_smt_active())
>>> smt_nr = cpumask_weight(cpu_smt_mask(cpu));
>>> #endif
>>> + /*
>>> + * Scale the Core number in a LLC by llc_aggr_tolerance
>>> + * and compare it to the task's active threads.
>>> + *
>>> + * Suppose the number of Cores in LLC is 8.
>>> + * Every core has 2 SMTs.
>>> + * If the llc_aggr_tolerance is 1: When the
>>> + * nr_running is larger than 8, the process
>>> + * is regarded as exceeding the LLC capacity.
>>> + * If the llc_aggr_tolerance is 99:
>>> + * When the nr_running is larger than 785,
>>> + * the process is regarded as exceeding
>>> + * the LLC capacity:
>>> + * 785 = 1 + (99 - 1) * 8
>>> + */
>>> + scale = get_sched_cache_scale(1);
>>> + if (scale == INT_MAX)
>>> + return false;
>>> 
>>> - return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
>>> + return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu)));
>>> }
>>> 
>>> static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
>>> @@ -1350,9 +1399,9 @@ static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
>>> long delta = now - rq->cpu_epoch_next;
>>> 
>>> if (delta > 0) {
>>> - n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
>>> + n = (delta + llc_epoch_period - 1) / llc_epoch_period;
>>> rq->cpu_epoch += n;
>>> - rq->cpu_epoch_next += n * EPOCH_PERIOD;
>>> + rq->cpu_epoch_next += n * llc_epoch_period;
>>> __shr_u64(&rq->cpu_runtime, n);
>>> }
>>> 
>>> @@ -1412,7 +1461,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>>> * has only 1 thread, or has too many active threads, invalidate
>>> * its preferred state.
>>> */
>>> - if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
>>> + if (epoch - READ_ONCE(mm->mm_sched_epoch) > llc_epoch_affinity_timeout ||
>>>  get_nr_threads(p) <= 1 ||
>>>  exceed_llc_nr(mm, cpu_of(rq)) ||
>>>  exceed_llc_capacity(mm, cpu_of(rq))) {
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 40798a06e058..15d126bd3728 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -2852,6 +2852,11 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
>>> #ifdef CONFIG_SCHED_CACHE
>>> extern unsigned int llc_overload_pct;
>>> extern unsigned int llc_imb_pct;
>>> +extern unsigned int llc_aggr_tolerance;
>>> +extern unsigned int llc_epoch_period;
>>> +extern unsigned int llc_epoch_affinity_timeout;
>>> +extern unsigned int llc_enabled;
>>> +void sched_cache_set(bool locked);
>>> #endif
>>> 
>>> #ifdef CONFIG_SCHED_HRTICK
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index 9799e3a9a609..818599ddaaef 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -26,6 +26,49 @@ int max_llcs;
>>> 
>>> static bool sched_cache_present;
>>> 
>>> +unsigned int llc_enabled = 1;
>>> +DEFINE_STATIC_KEY_FALSE(sched_cache_on);
>>> +
>>> +/*
>>> + * Enable/disable cache aware scheduling according to
>>> + * user input and the presence of hardware support.
>>> + */
>>> +static void _sched_cache_set(bool enable, bool locked)
>>> +{
>>> + if (enable) {
>>> + if (locked)
>>> + static_branch_enable_cpuslocked(&sched_cache_on);
>>> + else
>>> + static_branch_enable(&sched_cache_on);
>>> + } else {
>>> + if (locked)
>>> + static_branch_disable_cpuslocked(&sched_cache_on);
>>> + else
>>> + static_branch_disable(&sched_cache_on);
>>> + }
>>> +}
>>> +
>>> +void sched_cache_set(bool locked)
>>> +{
>>> + /* hardware does not support */
>>> + if (!sched_cache_present) {
>>> + if (static_branch_likely(&sched_cache_on))
>>> + _sched_cache_set(false, locked);
>>> +
>>> + return;
>>> + }
>>> +
>>> + /* user wants it or not ?*/
>>> + if (llc_enabled) {
>>> + if (!static_branch_likely(&sched_cache_on))
>>> + _sched_cache_set(true, locked);
>>> +
>>> + } else {
>>> + if (static_branch_likely(&sched_cache_on))
>>> + _sched_cache_set(false, locked);
>>> + }
>>> +}
>>> +
>>> static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int **gc)
>>> {
>>> unsigned int *new = NULL;
>>> @@ -70,8 +113,12 @@ static int resize_llc_pref(bool has_multi_llcs)
>>> * new buffer.
>>> */
>>> tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
>>> - if (!tmp_llc_pref)
>>> - return -ENOMEM;
>>> + if (!tmp_llc_pref) {
>>> + sched_cache_present = false;
>>> + ret = -ENOMEM;
>>> +
>>> + goto out;
>>> + }
>>> 
>>> for_each_present_cpu(i)
>>> *per_cpu_ptr(tmp_llc_pref, i) = NULL;
>>> @@ -89,6 +136,7 @@ static int resize_llc_pref(bool has_multi_llcs)
>>> new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
>>> if (!new) {
>>> ret = -ENOMEM;
>>> + sched_cache_present = false;
>>> 
>>> goto release_old;
>>> }
>>> @@ -126,6 +174,8 @@ static int resize_llc_pref(bool has_multi_llcs)
>>> if (!ret)
>>> max_llcs = new_max_llcs;
>>> 
>>> +out:
>>> + sched_cache_set(true);
>>> return ret;
>>> }
>>> 
>>> -- 
>>> 2.32.0
>> 
>>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>> 
>>> From: Chen Yu <yu.c.chen@intel.com>
>>> 
>>> Introduce a set of debugfs knobs to control the enabling of
>>> and parameters for cache-aware load balancing.
>>> 
>>> (1) llc_enabled
>>> llc_enabled acts as the primary switch - users can toggle it to
>>> enable or disable cache aware load balancing.
>>> 
>>> (2) llc_aggr_tolerance
>>> With sched_cache enabled, the scheduler uses a process's RSS as a
>>> proxy for its LLC footprint to determine if aggregating tasks on the
>>> preferred LLC could cause cache contention. If RSS exceeds the LLC
>>> size, aggregation is skipped. Some workloads with large RSS but small
>>> actual memory footprints may still benefit from aggregation. Since
>>> the kernel cannot efficiently track per-task cache usage (resctrl is
>>> user-space only), userspace can provide a more accurate hint.
>>> 
>>> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
>>> users control how strictly RSS limits aggregation. Values range from
>>> 0 to 100:
>>> 
>>> - 0: Cache-aware scheduling is disabled.
>>> - 1: Strict; tasks with RSS larger than LLC size are skipped.
>>> - 100: Aggressive; tasks are aggregated regardless of RSS.
>>> 
>> 
>> Hi Chen Yu and Tim Chen,
>> 
>> Maybe we should have something like prctl(PR_LLC_AGGR_TOLERANCE, 100).
>> 
>> I have tested this version of the patch on my EPYC Milan 7V13 (7763 variant) physical machine, with 32M LLC for each 8-core CCX. I found that I need to tune "llc_aggr_tolerance" to 100, else I can't get cache-aware scheduling to work on Verilated [1] XiangShan [2] running the chacha20 [3] as I mentioned before [4].
>> 
>> But if I set it to 100, I will lose some performance on stream copy benchmarks since the bandwidth is limited per CCX. Thus, I think we should have a new prctl to let userspace software hint the kernel that this task can be bound by latency between cores, and should use this feature no matter the RSS exceed the LLC size.
>> 
>> I finally have an EPYC Milan at home today. I'm glad to test this patch further. And I'm very willing to have this thus we can have a fast verilator[1] without numactl manually.
>> 
>> [1] https://github.com/verilator/verilator
>> [2] https://github.com/OpenXiangShan/Xiangshan
>> [3] https://github.com/cyyself/chacha20-xiangshan
>> [4] https://lore.kernel.org/lkml/tencent_6E51A3175F8AE0A7F684A319EE63CC56C806@qq.com/
>> 
>> Thanks,
>> Yangyu Chen
>> 
>>> For exaImple, with a 32MB L3 cache:
>>> 
>>> - llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
>>> - llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
>>>  (784GB = (1 + (99 - 1) * 256) * 32MB).
>>> 
>>> Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
>>> how strictly the number of active threads is considered when doing
>>> cache aware load balance. The number of SMTs is also considered.
>>> High SMT counts reduce the aggregation capacity, preventing excessive
>>> task aggregation on SMT-heavy systems like Power10/Power11.
>>> 
>>> For example, with 8 Cores/16 CPUs in a L3:
>>> 
>>> - llc_aggr_tolerance=1 -> tasks with nr_running > 8 are skipped.
>>> - llc_aggr_tolerance=99 -> tasks with nr_running > 785 are skipped
>>>  785 = (1 + (99 - 1) * 8).
>>> 
>>> (3) llc_epoch_period/llc_epoch_affinity_timeout
>>> Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned
>>> into tunable.
>>> 
>>> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>> Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
>>> Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>> Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
>>> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>> ---
>>> 
>>> Notes:
>>>  v1->v2: Remove the smt_nr check in fits_llc_capacity().
>>>          (Aaron Lu)
>>> 
>>> include/linux/sched.h   |  4 ++-
>>> kernel/sched/debug.c    | 62 ++++++++++++++++++++++++++++++++++++++++
>>> kernel/sched/fair.c     | 63 ++++++++++++++++++++++++++++++++++++-----
>>> kernel/sched/sched.h    |  5 ++++
>>> kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++--
>>> 5 files changed, 178 insertions(+), 10 deletions(-)
>>> 
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index 466ba8b7398c..95bf080bbbf0 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
>>> DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>>> 
>>> #ifdef CONFIG_SCHED_CACHE
>>> +DECLARE_STATIC_KEY_FALSE(sched_cache_on);
>>> +
>>> static inline bool sched_cache_enabled(void)
>>> {
>>> - return false;
>>> + return static_branch_unlikely(&sched_cache_on);
>>> }
>>> #endif
>>> 
>>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>>> index 02e16b70a790..cde324672103 100644
>>> --- a/kernel/sched/debug.c
>>> +++ b/kernel/sched/debug.c
>>> @@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops = {
>>> .release = single_release,
>>> };
>>> 
>>> +#ifdef CONFIG_SCHED_CACHE
>>> +#define SCHED_CACHE_CREATE_CONTROL(name, max)  \
>>> +static ssize_t sched_cache_write_##name(struct file *filp,  \
>>> + const char __user *ubuf,  \
>>> + size_t cnt, loff_t *ppos) \
>>> +{  \
>>> + char buf[16];  \
>>> + unsigned int val;  \
>>> + if (cnt > 15)  \
>>> + cnt = 15;  \
>>> + if (copy_from_user(&buf, ubuf, cnt))  \
>>> + return -EFAULT;  \
>>> + buf[cnt] = '\0';  \
>>> + if (kstrtouint(buf, 10, &val))  \
>>> + return -EINVAL;  \
>>> + if (val > (max))  \
>>> + return -EINVAL;  \
>>> + llc_##name = val;  \
>>> + if (!strcmp(#name, "enabled"))  \
>>> + sched_cache_set(false);  \
>>> + *ppos += cnt;  \
>>> + return cnt;  \
>>> +}  \
>>> +static int sched_cache_show_##name(struct seq_file *m, void *v)  \
>>> +{  \
>>> + seq_printf(m, "%d\n", llc_##name);  \
>>> + return 0;  \
>>> +}  \
>>> +static int sched_cache_open_##name(struct inode *inode,  \
>>> +   struct file *filp)  \
>>> +{  \
>>> + return single_open(filp, sched_cache_show_##name, NULL);  \
>>> +}  \
>>> +static const struct file_operations sched_cache_fops_##name = {  \
>>> + .open = sched_cache_open_##name,  \
>>> + .write = sched_cache_write_##name,  \
>>> + .read = seq_read,  \
>>> + .llseek = seq_lseek,  \
>>> + .release = single_release,  \
>>> +}
>>> +
>>> +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
>>> +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
>>> +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
>>> +SCHED_CACHE_CREATE_CONTROL(enabled, 1);
>>> +#endif /* SCHED_CACHE */
>>> +
>>> static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf,
>>> size_t cnt, loff_t *ppos)
>>> {
>>> @@ -523,6 +570,21 @@ static __init int sched_init_debug(void)
>>> debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
>>> #endif /* CONFIG_NUMA_BALANCING */
>>> 
>>> +#ifdef CONFIG_SCHED_CACHE
>>> + debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_overload_pct);
>>> + debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_imb_pct);
>>> + debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_aggr_tolerance);
>>> + debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_enabled);
>>> + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
>>> +   &llc_epoch_period);
>>> + debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
>>> +   &llc_epoch_affinity_timeout);
>>> +#endif
>>> +
>>> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>>> 
>>> debugfs_fair_server_init();
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 424ec601cfdf..a2e2d6742481 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>>> 
>>> __read_mostly unsigned int llc_overload_pct       = 50;
>>> __read_mostly unsigned int llc_imb_pct            = 20;
>>> +__read_mostly unsigned int llc_aggr_tolerance     = 1;
>>> +__read_mostly unsigned int llc_epoch_period       = EPOCH_PERIOD;
>>> +__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
>>> 
>>> static int llc_id(int cpu)
>>> {
>>> @@ -1223,11 +1226,22 @@ static int llc_id(int cpu)
>>> return llc;
>>> }
>>> 
>>> +static inline int get_sched_cache_scale(int mul)
>>> +{
>>> + if (!llc_aggr_tolerance)
>>> + return 0;
>>> +
>>> + if (llc_aggr_tolerance == 100)
>>> + return INT_MAX;
>>> +
>>> + return (1 + (llc_aggr_tolerance - 1) * mul);
>>> +}
>>> +
>>> static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>>> {
>>> + unsigned int llc, scale;
>>> struct cacheinfo *ci;
>>> unsigned long rss;
>>> - unsigned int llc;
>>> 
>>> /*
>>> * get_cpu_cacheinfo_level() can not be used
>>> @@ -1252,19 +1266,54 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>>> rss = get_mm_counter(mm, MM_ANONPAGES) +
>>> get_mm_counter(mm, MM_SHMEMPAGES);
>>> 
>>> - return (llc <= (rss * PAGE_SIZE));
>>> + /*
>>> + * Scale the LLC size by 256*llc_aggr_tolerance
>>> + * and compare it to the task's RSS size.
>>> + *
>>> + * Suppose the L3 size is 32MB. If the
>>> + * llc_aggr_tolerance is 1:
>>> + * When the RSS is larger than 32MB, the process
>>> + * is regarded as exceeding the LLC capacity. If
>>> + * the llc_aggr_tolerance is 99:
>>> + * When the RSS is larger than 784GB, the process
>>> + * is regarded as exceeding the LLC capacity because:
>>> + * 784GB = (1 + (99 - 1) * 256) * 32MB
>>> + */
>>> + scale = get_sched_cache_scale(256);
>>> + if (scale == INT_MAX)
>>> + return false;
>>> +
>>> + return ((llc * scale) <= (rss * PAGE_SIZE));
>>> }
>>> 
>>> static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>>> {
>>> - int smt_nr = 1;
>>> + int smt_nr = 1, scale;
>>> 
>>> #ifdef CONFIG_SCHED_SMT
>>> if (sched_smt_active())
>>> smt_nr = cpumask_weight(cpu_smt_mask(cpu));
>>> #endif
>>> + /*
>>> + * Scale the Core number in a LLC by llc_aggr_tolerance
>>> + * and compare it to the task's active threads.
>>> + *
>>> + * Suppose the number of Cores in LLC is 8.
>>> + * Every core has 2 SMTs.
>>> + * If the llc_aggr_tolerance is 1: When the
>>> + * nr_running is larger than 8, the process
>>> + * is regarded as exceeding the LLC capacity.
>>> + * If the llc_aggr_tolerance is 99:
>>> + * When the nr_running is larger than 785,
>>> + * the process is regarded as exceeding
>>> + * the LLC capacity:
>>> + * 785 = 1 + (99 - 1) * 8
>>> + */
>>> + scale = get_sched_cache_scale(1);
>>> + if (scale == INT_MAX)
>>> + return false;
>>> 
>>> - return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
>>> + return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu)));
>>> }
>>> 
>>> static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
>>> @@ -1350,9 +1399,9 @@ static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
>>> long delta = now - rq->cpu_epoch_next;
>>> 
>>> if (delta > 0) {
>>> - n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
>>> + n = (delta + llc_epoch_period - 1) / llc_epoch_period;
>>> rq->cpu_epoch += n;
>>> - rq->cpu_epoch_next += n * EPOCH_PERIOD;
>>> + rq->cpu_epoch_next += n * llc_epoch_period;
>>> __shr_u64(&rq->cpu_runtime, n);
>>> }
>>> 
>>> @@ -1412,7 +1461,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>>> * has only 1 thread, or has too many active threads, invalidate
>>> * its preferred state.
>>> */
>>> - if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
>>> + if (epoch - READ_ONCE(mm->mm_sched_epoch) > llc_epoch_affinity_timeout ||
>>>  get_nr_threads(p) <= 1 ||
>>>  exceed_llc_nr(mm, cpu_of(rq)) ||
>>>  exceed_llc_capacity(mm, cpu_of(rq))) {
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 40798a06e058..15d126bd3728 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -2852,6 +2852,11 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
>>> #ifdef CONFIG_SCHED_CACHE
>>> extern unsigned int llc_overload_pct;
>>> extern unsigned int llc_imb_pct;
>>> +extern unsigned int llc_aggr_tolerance;
>>> +extern unsigned int llc_epoch_period;
>>> +extern unsigned int llc_epoch_affinity_timeout;
>>> +extern unsigned int llc_enabled;
>>> +void sched_cache_set(bool locked);
>>> #endif
>>> 
>>> #ifdef CONFIG_SCHED_HRTICK
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index 9799e3a9a609..818599ddaaef 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -26,6 +26,49 @@ int max_llcs;
>>> 
>>> static bool sched_cache_present;
>>> 
>>> +unsigned int llc_enabled = 1;
>>> +DEFINE_STATIC_KEY_FALSE(sched_cache_on);
>>> +
>>> +/*
>>> + * Enable/disable cache aware scheduling according to
>>> + * user input and the presence of hardware support.
>>> + */
>>> +static void _sched_cache_set(bool enable, bool locked)
>>> +{
>>> + if (enable) {
>>> + if (locked)
>>> + static_branch_enable_cpuslocked(&sched_cache_on);
>>> + else
>>> + static_branch_enable(&sched_cache_on);
>>> + } else {
>>> + if (locked)
>>> + static_branch_disable_cpuslocked(&sched_cache_on);
>>> + else
>>> + static_branch_disable(&sched_cache_on);
>>> + }
>>> +}
>>> +
>>> +void sched_cache_set(bool locked)
>>> +{
>>> + /* hardware does not support */
>>> + if (!sched_cache_present) {
>>> + if (static_branch_likely(&sched_cache_on))
>>> + _sched_cache_set(false, locked);
>>> +
>>> + return;
>>> + }
>>> +
>>> + /* user wants it or not ?*/
>>> + if (llc_enabled) {
>>> + if (!static_branch_likely(&sched_cache_on))
>>> + _sched_cache_set(true, locked);
>>> +
>>> + } else {
>>> + if (static_branch_likely(&sched_cache_on))
>>> + _sched_cache_set(false, locked);
>>> + }
>>> +}
>>> +
>>> static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int **gc)
>>> {
>>> unsigned int *new = NULL;
>>> @@ -70,8 +113,12 @@ static int resize_llc_pref(bool has_multi_llcs)
>>> * new buffer.
>>> */
>>> tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
>>> - if (!tmp_llc_pref)
>>> - return -ENOMEM;
>>> + if (!tmp_llc_pref) {
>>> + sched_cache_present = false;
>>> + ret = -ENOMEM;
>>> +
>>> + goto out;
>>> + }
>>> 
>>> for_each_present_cpu(i)
>>> *per_cpu_ptr(tmp_llc_pref, i) = NULL;
>>> @@ -89,6 +136,7 @@ static int resize_llc_pref(bool has_multi_llcs)
>>> new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
>>> if (!new) {
>>> ret = -ENOMEM;
>>> + sched_cache_present = false;
>>> 
>>> goto release_old;
>>> }
>>> @@ -126,6 +174,8 @@ static int resize_llc_pref(bool has_multi_llcs)
>>> if (!ret)
>>> max_llcs = new_max_llcs;
>>> 
>>> +out:
>>> + sched_cache_set(true);
>>> return ret;
>>> }
>>> 
>>> -- 
>>> 2.32.0
> 
> 
>> On 23 Dec 2025, at 20:12, Yangyu Chen <cyy@cyyself.name> wrote:
>> 
>>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>> 
>>> From: Chen Yu <yu.c.chen@intel.com>
>>> 
>>> Introduce a set of debugfs knobs to control the enabling of
>>> and parameters for cache-aware load balancing.
>>> 
>>> (1) llc_enabled
>>> llc_enabled acts as the primary switch - users can toggle it to
>>> enable or disable cache aware load balancing.
>>> 
>>> (2) llc_aggr_tolerance
>>> With sched_cache enabled, the scheduler uses a process's RSS as a
>>> proxy for its LLC footprint to determine if aggregating tasks on the
>>> preferred LLC could cause cache contention. If RSS exceeds the LLC
>>> size, aggregation is skipped. Some workloads with large RSS but small
>>> actual memory footprints may still benefit from aggregation. Since
>>> the kernel cannot efficiently track per-task cache usage (resctrl is
>>> user-space only), userspace can provide a more accurate hint.
>>> 
>>> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
>>> users control how strictly RSS limits aggregation. Values range from
>>> 0 to 100:
>>> 
>>> - 0: Cache-aware scheduling is disabled.
>>> - 1: Strict; tasks with RSS larger than LLC size are skipped.
>>> - 100: Aggressive; tasks are aggregated regardless of RSS.
>>> 
>> 
>> Hi Chen Yu and Tim Chen,
>> 
>> Maybe we should have something like prctl(PR_LLC_AGGR_TOLERANCE, 100).
>> 
>> I have tested this version of the patch on my EPYC Milan 7V13 (7763 variant) physical machine, with 32M LLC for each 8-core CCX. I found that I need to tune "llc_aggr_tolerance" to 100, else I can't get cache-aware scheduling to work on Verilated [1] XiangShan [2] running the chacha20 [3] as I mentioned before [4].
>> 
> 
> In addition, I have investigated why this happens. And finally I
> realized that's because that workload observed 35596 kB RssAnon on
> my EPYC Milan Machine, slightly exceeding the LLC size (32M). I
> have tested it on an EPYC Genoa cloud server with the correct core
> / cache hierarchy in ACPI table, that shows 31700 kB RssAnon, thus
> fitting in LLC. I have no idea why my result shows higher RssAnon,
> since they both run Debian Trixie with the exact same kernel and
> same executable. But it reminds me we should have a userspace API
> for that.
> 
> Thanks,
> Yangyu Chen
> 
>> But if I set it to 100, I will lose some performance on stream copy benchmarks since the bandwidth is limited per CCX. Thus, I think we should have a new prctl to let userspace software hint the kernel that this task can be bound by latency between cores, and should use this feature no matter the RSS exceed the LLC size.
>> 
>> I finally have an EPYC Milan at home today. I'm glad to test this patch further. And I'm very willing to have this thus we can have a fast verilator[1] without numactl manually.
>> 
>> [1] https://github.com/verilator/verilator
>> [2] https://github.com/OpenXiangShan/Xiangshan
>> [3] https://github.com/cyyself/chacha20-xiangshan
>> [4] https://lore.kernel.org/lkml/tencent_6E51A3175F8AE0A7F684A319EE63CC56C806@qq.com/
>> 
>> Thanks,
>> Yangyu Chen
>> 
>>> For exaImple, with a 32MB L3 cache:
>>> 
>>> - llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
>>> - llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
>>>  (784GB = (1 + (99 - 1) * 256) * 32MB).
>>> 
>>> Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
>>> how strictly the number of active threads is considered when doing
>>> cache aware load balance. The number of SMTs is also considered.
>>> High SMT counts reduce the aggregation capacity, preventing excessive
>>> task aggregation on SMT-heavy systems like Power10/Power11.
>>> 
>>> For example, with 8 Cores/16 CPUs in a L3:
>>> 
>>> - llc_aggr_tolerance=1 -> tasks with nr_running > 8 are skipped.
>>> - llc_aggr_tolerance=99 -> tasks with nr_running > 785 are skipped
>>>  785 = (1 + (99 - 1) * 8).
>>> 
>>> (3) llc_epoch_period/llc_epoch_affinity_timeout
>>> Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned
>>> into tunable.
>>> 
>>> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>> Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
>>> Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>> Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
>>> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>> ---
>>> 
>>> Notes:
>>>  v1->v2: Remove the smt_nr check in fits_llc_capacity().
>>>          (Aaron Lu)
>>> 
>>> include/linux/sched.h   |  4 ++-
>>> kernel/sched/debug.c    | 62 ++++++++++++++++++++++++++++++++++++++++
>>> kernel/sched/fair.c     | 63 ++++++++++++++++++++++++++++++++++++-----
>>> kernel/sched/sched.h    |  5 ++++
>>> kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++--
>>> 5 files changed, 178 insertions(+), 10 deletions(-)
>>> 
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index 466ba8b7398c..95bf080bbbf0 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
>>> DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>>> 
>>> #ifdef CONFIG_SCHED_CACHE
>>> +DECLARE_STATIC_KEY_FALSE(sched_cache_on);
>>> +
>>> static inline bool sched_cache_enabled(void)
>>> {
>>> - return false;
>>> + return static_branch_unlikely(&sched_cache_on);
>>> }
>>> #endif
>>> 
>>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>>> index 02e16b70a790..cde324672103 100644
>>> --- a/kernel/sched/debug.c
>>> +++ b/kernel/sched/debug.c
>>> @@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops = {
>>> .release = single_release,
>>> };
>>> 
>>> +#ifdef CONFIG_SCHED_CACHE
>>> +#define SCHED_CACHE_CREATE_CONTROL(name, max)  \
>>> +static ssize_t sched_cache_write_##name(struct file *filp,  \
>>> + const char __user *ubuf,  \
>>> + size_t cnt, loff_t *ppos) \
>>> +{  \
>>> + char buf[16];  \
>>> + unsigned int val;  \
>>> + if (cnt > 15)  \
>>> + cnt = 15;  \
>>> + if (copy_from_user(&buf, ubuf, cnt))  \
>>> + return -EFAULT;  \
>>> + buf[cnt] = '\0';  \
>>> + if (kstrtouint(buf, 10, &val))  \
>>> + return -EINVAL;  \
>>> + if (val > (max))  \
>>> + return -EINVAL;  \
>>> + llc_##name = val;  \
>>> + if (!strcmp(#name, "enabled"))  \
>>> + sched_cache_set(false);  \
>>> + *ppos += cnt;  \
>>> + return cnt;  \
>>> +}  \
>>> +static int sched_cache_show_##name(struct seq_file *m, void *v)  \
>>> +{  \
>>> + seq_printf(m, "%d\n", llc_##name);  \
>>> + return 0;  \
>>> +}  \
>>> +static int sched_cache_open_##name(struct inode *inode,  \
>>> +   struct file *filp)  \
>>> +{  \
>>> + return single_open(filp, sched_cache_show_##name, NULL);  \
>>> +}  \
>>> +static const struct file_operations sched_cache_fops_##name = {  \
>>> + .open = sched_cache_open_##name,  \
>>> + .write = sched_cache_write_##name,  \
>>> + .read = seq_read,  \
>>> + .llseek = seq_lseek,  \
>>> + .release = single_release,  \
>>> +}
>>> +
>>> +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
>>> +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
>>> +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
>>> +SCHED_CACHE_CREATE_CONTROL(enabled, 1);
>>> +#endif /* SCHED_CACHE */
>>> +
>>> static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf,
>>> size_t cnt, loff_t *ppos)
>>> {
>>> @@ -523,6 +570,21 @@ static __init int sched_init_debug(void)
>>> debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
>>> #endif /* CONFIG_NUMA_BALANCING */
>>> 
>>> +#ifdef CONFIG_SCHED_CACHE
>>> + debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_overload_pct);
>>> + debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_imb_pct);
>>> + debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_aggr_tolerance);
>>> + debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_enabled);
>>> + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
>>> +   &llc_epoch_period);
>>> + debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
>>> +   &llc_epoch_affinity_timeout);
>>> +#endif
>>> +
>>> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>>> 
>>> debugfs_fair_server_init();
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 424ec601cfdf..a2e2d6742481 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>>> 
>>> __read_mostly unsigned int llc_overload_pct       = 50;
>>> __read_mostly unsigned int llc_imb_pct            = 20;
>>> +__read_mostly unsigned int llc_aggr_tolerance     = 1;
>>> +__read_mostly unsigned int llc_epoch_period       = EPOCH_PERIOD;
>>> +__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
>>> 
>>> static int llc_id(int cpu)
>>> {
>>> @@ -1223,11 +1226,22 @@ static int llc_id(int cpu)
>>> return llc;
>>> }
>>> 
>>> +static inline int get_sched_cache_scale(int mul)
>>> +{
>>> + if (!llc_aggr_tolerance)
>>> + return 0;
>>> +
>>> + if (llc_aggr_tolerance == 100)
>>> + return INT_MAX;
>>> +
>>> + return (1 + (llc_aggr_tolerance - 1) * mul);
>>> +}
>>> +
>>> static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>>> {
>>> + unsigned int llc, scale;
>>> struct cacheinfo *ci;
>>> unsigned long rss;
>>> - unsigned int llc;
>>> 
>>> /*
>>> * get_cpu_cacheinfo_level() can not be used
>>> @@ -1252,19 +1266,54 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>>> rss = get_mm_counter(mm, MM_ANONPAGES) +
>>> get_mm_counter(mm, MM_SHMEMPAGES);
>>> 
>>> - return (llc <= (rss * PAGE_SIZE));
>>> + /*
>>> + * Scale the LLC size by 256*llc_aggr_tolerance
>>> + * and compare it to the task's RSS size.
>>> + *
>>> + * Suppose the L3 size is 32MB. If the
>>> + * llc_aggr_tolerance is 1:
>>> + * When the RSS is larger than 32MB, the process
>>> + * is regarded as exceeding the LLC capacity. If
>>> + * the llc_aggr_tolerance is 99:
>>> + * When the RSS is larger than 784GB, the process
>>> + * is regarded as exceeding the LLC capacity because:
>>> + * 784GB = (1 + (99 - 1) * 256) * 32MB
>>> + */
>>> + scale = get_sched_cache_scale(256);
>>> + if (scale == INT_MAX)
>>> + return false;
>>> +
>>> + return ((llc * scale) <= (rss * PAGE_SIZE));
>>> }
>>> 
>>> static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>>> {
>>> - int smt_nr = 1;
>>> + int smt_nr = 1, scale;
>>> 
>>> #ifdef CONFIG_SCHED_SMT
>>> if (sched_smt_active())
>>> smt_nr = cpumask_weight(cpu_smt_mask(cpu));
>>> #endif
>>> + /*
>>> + * Scale the Core number in a LLC by llc_aggr_tolerance
>>> + * and compare it to the task's active threads.
>>> + *
>>> + * Suppose the number of Cores in LLC is 8.
>>> + * Every core has 2 SMTs.
>>> + * If the llc_aggr_tolerance is 1: When the
>>> + * nr_running is larger than 8, the process
>>> + * is regarded as exceeding the LLC capacity.
>>> + * If the llc_aggr_tolerance is 99:
>>> + * When the nr_running is larger than 785,
>>> + * the process is regarded as exceeding
>>> + * the LLC capacity:
>>> + * 785 = 1 + (99 - 1) * 8
>>> + */
>>> + scale = get_sched_cache_scale(1);
>>> + if (scale == INT_MAX)
>>> + return false;
>>> 
>>> - return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
>>> + return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu)));
>>> }
>>> 
>>> static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
>>> @@ -1350,9 +1399,9 @@ static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
>>> long delta = now - rq->cpu_epoch_next;
>>> 
>>> if (delta > 0) {
>>> - n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
>>> + n = (delta + llc_epoch_period - 1) / llc_epoch_period;
>>> rq->cpu_epoch += n;
>>> - rq->cpu_epoch_next += n * EPOCH_PERIOD;
>>> + rq->cpu_epoch_next += n * llc_epoch_period;
>>> __shr_u64(&rq->cpu_runtime, n);
>>> }
>>> 
>>> @@ -1412,7 +1461,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>>> * has only 1 thread, or has too many active threads, invalidate
>>> * its preferred state.
>>> */
>>> - if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
>>> + if (epoch - READ_ONCE(mm->mm_sched_epoch) > llc_epoch_affinity_timeout ||
>>>  get_nr_threads(p) <= 1 ||
>>>  exceed_llc_nr(mm, cpu_of(rq)) ||
>>>  exceed_llc_capacity(mm, cpu_of(rq))) {
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 40798a06e058..15d126bd3728 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -2852,6 +2852,11 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
>>> #ifdef CONFIG_SCHED_CACHE
>>> extern unsigned int llc_overload_pct;
>>> extern unsigned int llc_imb_pct;
>>> +extern unsigned int llc_aggr_tolerance;
>>> +extern unsigned int llc_epoch_period;
>>> +extern unsigned int llc_epoch_affinity_timeout;
>>> +extern unsigned int llc_enabled;
>>> +void sched_cache_set(bool locked);
>>> #endif
>>> 
>>> #ifdef CONFIG_SCHED_HRTICK
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index 9799e3a9a609..818599ddaaef 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -26,6 +26,49 @@ int max_llcs;
>>> 
>>> static bool sched_cache_present;
>>> 
>>> +unsigned int llc_enabled = 1;
>>> +DEFINE_STATIC_KEY_FALSE(sched_cache_on);
>>> +
>>> +/*
>>> + * Enable/disable cache aware scheduling according to
>>> + * user input and the presence of hardware support.
>>> + */
>>> +static void _sched_cache_set(bool enable, bool locked)
>>> +{
>>> + if (enable) {
>>> + if (locked)
>>> + static_branch_enable_cpuslocked(&sched_cache_on);
>>> + else
>>> + static_branch_enable(&sched_cache_on);
>>> + } else {
>>> + if (locked)
>>> + static_branch_disable_cpuslocked(&sched_cache_on);
>>> + else
>>> + static_branch_disable(&sched_cache_on);
>>> + }
>>> +}
>>> +
>>> +void sched_cache_set(bool locked)
>>> +{
>>> + /* hardware does not support */
>>> + if (!sched_cache_present) {
>>> + if (static_branch_likely(&sched_cache_on))
>>> + _sched_cache_set(false, locked);
>>> +
>>> + return;
>>> + }
>>> +
>>> + /* user wants it or not ?*/
>>> + if (llc_enabled) {
>>> + if (!static_branch_likely(&sched_cache_on))
>>> + _sched_cache_set(true, locked);
>>> +
>>> + } else {
>>> + if (static_branch_likely(&sched_cache_on))
>>> + _sched_cache_set(false, locked);
>>> + }
>>> +}
>>> +
>>> static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int **gc)
>>> {
>>> unsigned int *new = NULL;
>>> @@ -70,8 +113,12 @@ static int resize_llc_pref(bool has_multi_llcs)
>>> * new buffer.
>>> */
>>> tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
>>> - if (!tmp_llc_pref)
>>> - return -ENOMEM;
>>> + if (!tmp_llc_pref) {
>>> + sched_cache_present = false;
>>> + ret = -ENOMEM;
>>> +
>>> + goto out;
>>> + }
>>> 
>>> for_each_present_cpu(i)
>>> *per_cpu_ptr(tmp_llc_pref, i) = NULL;
>>> @@ -89,6 +136,7 @@ static int resize_llc_pref(bool has_multi_llcs)
>>> new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
>>> if (!new) {
>>> ret = -ENOMEM;
>>> + sched_cache_present = false;
>>> 
>>> goto release_old;
>>> }
>>> @@ -126,6 +174,8 @@ static int resize_llc_pref(bool has_multi_llcs)
>>> if (!ret)
>>> max_llcs = new_max_llcs;
>>> 
>>> +out:
>>> + sched_cache_set(true);
>>> return ret;
>>> }
>>> 
>>> -- 
>>> 2.32.0
>> 
>>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>> 
>>> From: Chen Yu <yu.c.chen@intel.com>
>>> 
>>> Introduce a set of debugfs knobs to control the enabling of
>>> and parameters for cache-aware load balancing.
>>> 
>>> (1) llc_enabled
>>> llc_enabled acts as the primary switch - users can toggle it to
>>> enable or disable cache aware load balancing.
>>> 
>>> (2) llc_aggr_tolerance
>>> With sched_cache enabled, the scheduler uses a process's RSS as a
>>> proxy for its LLC footprint to determine if aggregating tasks on the
>>> preferred LLC could cause cache contention. If RSS exceeds the LLC
>>> size, aggregation is skipped. Some workloads with large RSS but small
>>> actual memory footprints may still benefit from aggregation. Since
>>> the kernel cannot efficiently track per-task cache usage (resctrl is
>>> user-space only), userspace can provide a more accurate hint.
>>> 
>>> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
>>> users control how strictly RSS limits aggregation. Values range from
>>> 0 to 100:
>>> 
>>> - 0: Cache-aware scheduling is disabled.
>>> - 1: Strict; tasks with RSS larger than LLC size are skipped.
>>> - 100: Aggressive; tasks are aggregated regardless of RSS.
>>> 
>> 
>> Hi Chen Yu and Tim Chen,
>> 
>> Maybe we should have something like prctl(PR_LLC_AGGR_TOLERANCE, 100).
>> 
>> I have tested this version of the patch on my EPYC Milan 7V13 (7763 variant) physical machine, with 32M LLC for each 8-core CCX. I found that I need to tune "llc_aggr_tolerance" to 100, else I can't get cache-aware scheduling to work on Verilated [1] XiangShan [2] running the chacha20 [3] as I mentioned before [4].
>> 
>> But if I set it to 100, I will lose some performance on stream copy benchmarks since the bandwidth is limited per CCX. Thus, I think we should have a new prctl to let userspace software hint the kernel that this task can be bound by latency between cores, and should use this feature no matter the RSS exceed the LLC size.
>> 
>> I finally have an EPYC Milan at home today. I'm glad to test this patch further. And I'm very willing to have this thus we can have a fast verilator[1] without numactl manually.
>> 
>> [1] https://github.com/verilator/verilator
>> [2] https://github.com/OpenXiangShan/Xiangshan
>> [3] https://github.com/cyyself/chacha20-xiangshan
>> [4] https://lore.kernel.org/lkml/tencent_6E51A3175F8AE0A7F684A319EE63CC56C806@qq.com/
>> 
>> Thanks,
>> Yangyu Chen
>> 
>>> For exaImple, with a 32MB L3 cache:
>>> 
>>> - llc_aggr_tolerance=1 -> tasks with RSS > 32MB are skipped.
>>> - llc_aggr_tolerance=99 -> tasks with RSS > 784GB are skipped
>>>  (784GB = (1 + (99 - 1) * 256) * 32MB).
>>> 
>>> Similarly, /sys/kernel/debug/sched/llc_aggr_tolerance also controls
>>> how strictly the number of active threads is considered when doing
>>> cache aware load balance. The number of SMTs is also considered.
>>> High SMT counts reduce the aggregation capacity, preventing excessive
>>> task aggregation on SMT-heavy systems like Power10/Power11.
>>> 
>>> For example, with 8 Cores/16 CPUs in a L3:
>>> 
>>> - llc_aggr_tolerance=1 -> tasks with nr_running > 8 are skipped.
>>> - llc_aggr_tolerance=99 -> tasks with nr_running > 785 are skipped
>>>  785 = (1 + (99 - 1) * 8).
>>> 
>>> (3) llc_epoch_period/llc_epoch_affinity_timeout
>>> Besides, llc_epoch_period and llc_epoch_affinity_timeout are also turned
>>> into tunable.
>>> 
>>> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>> Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
>>> Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>> Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
>>> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>> ---
>>> 
>>> Notes:
>>>  v1->v2: Remove the smt_nr check in fits_llc_capacity().
>>>          (Aaron Lu)
>>> 
>>> include/linux/sched.h   |  4 ++-
>>> kernel/sched/debug.c    | 62 ++++++++++++++++++++++++++++++++++++++++
>>> kernel/sched/fair.c     | 63 ++++++++++++++++++++++++++++++++++++-----
>>> kernel/sched/sched.h    |  5 ++++
>>> kernel/sched/topology.c | 54 +++++++++++++++++++++++++++++++++--
>>> 5 files changed, 178 insertions(+), 10 deletions(-)
>>> 
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index 466ba8b7398c..95bf080bbbf0 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -2436,9 +2436,11 @@ extern void migrate_enable(void);
>>> DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
>>> 
>>> #ifdef CONFIG_SCHED_CACHE
>>> +DECLARE_STATIC_KEY_FALSE(sched_cache_on);
>>> +
>>> static inline bool sched_cache_enabled(void)
>>> {
>>> - return false;
>>> + return static_branch_unlikely(&sched_cache_on);
>>> }
>>> #endif
>>> 
>>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>>> index 02e16b70a790..cde324672103 100644
>>> --- a/kernel/sched/debug.c
>>> +++ b/kernel/sched/debug.c
>>> @@ -169,6 +169,53 @@ static const struct file_operations sched_feat_fops = {
>>> .release = single_release,
>>> };
>>> 
>>> +#ifdef CONFIG_SCHED_CACHE
>>> +#define SCHED_CACHE_CREATE_CONTROL(name, max)  \
>>> +static ssize_t sched_cache_write_##name(struct file *filp,  \
>>> + const char __user *ubuf,  \
>>> + size_t cnt, loff_t *ppos) \
>>> +{  \
>>> + char buf[16];  \
>>> + unsigned int val;  \
>>> + if (cnt > 15)  \
>>> + cnt = 15;  \
>>> + if (copy_from_user(&buf, ubuf, cnt))  \
>>> + return -EFAULT;  \
>>> + buf[cnt] = '\0';  \
>>> + if (kstrtouint(buf, 10, &val))  \
>>> + return -EINVAL;  \
>>> + if (val > (max))  \
>>> + return -EINVAL;  \
>>> + llc_##name = val;  \
>>> + if (!strcmp(#name, "enabled"))  \
>>> + sched_cache_set(false);  \
>>> + *ppos += cnt;  \
>>> + return cnt;  \
>>> +}  \
>>> +static int sched_cache_show_##name(struct seq_file *m, void *v)  \
>>> +{  \
>>> + seq_printf(m, "%d\n", llc_##name);  \
>>> + return 0;  \
>>> +}  \
>>> +static int sched_cache_open_##name(struct inode *inode,  \
>>> +   struct file *filp)  \
>>> +{  \
>>> + return single_open(filp, sched_cache_show_##name, NULL);  \
>>> +}  \
>>> +static const struct file_operations sched_cache_fops_##name = {  \
>>> + .open = sched_cache_open_##name,  \
>>> + .write = sched_cache_write_##name,  \
>>> + .read = seq_read,  \
>>> + .llseek = seq_lseek,  \
>>> + .release = single_release,  \
>>> +}
>>> +
>>> +SCHED_CACHE_CREATE_CONTROL(overload_pct, 100);
>>> +SCHED_CACHE_CREATE_CONTROL(imb_pct, 100);
>>> +SCHED_CACHE_CREATE_CONTROL(aggr_tolerance, 100);
>>> +SCHED_CACHE_CREATE_CONTROL(enabled, 1);
>>> +#endif /* SCHED_CACHE */
>>> +
>>> static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf,
>>> size_t cnt, loff_t *ppos)
>>> {
>>> @@ -523,6 +570,21 @@ static __init int sched_init_debug(void)
>>> debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
>>> #endif /* CONFIG_NUMA_BALANCING */
>>> 
>>> +#ifdef CONFIG_SCHED_CACHE
>>> + debugfs_create_file("llc_overload_pct", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_overload_pct);
>>> + debugfs_create_file("llc_imb_pct", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_imb_pct);
>>> + debugfs_create_file("llc_aggr_tolerance", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_aggr_tolerance);
>>> + debugfs_create_file("llc_enabled", 0644, debugfs_sched, NULL,
>>> +    &sched_cache_fops_enabled);
>>> + debugfs_create_u32("llc_epoch_period", 0644, debugfs_sched,
>>> +   &llc_epoch_period);
>>> + debugfs_create_u32("llc_epoch_affinity_timeout", 0644, debugfs_sched,
>>> +   &llc_epoch_affinity_timeout);
>>> +#endif
>>> +
>>> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>>> 
>>> debugfs_fair_server_init();
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 424ec601cfdf..a2e2d6742481 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1207,6 +1207,9 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
>>> 
>>> __read_mostly unsigned int llc_overload_pct       = 50;
>>> __read_mostly unsigned int llc_imb_pct            = 20;
>>> +__read_mostly unsigned int llc_aggr_tolerance     = 1;
>>> +__read_mostly unsigned int llc_epoch_period       = EPOCH_PERIOD;
>>> +__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
>>> 
>>> static int llc_id(int cpu)
>>> {
>>> @@ -1223,11 +1226,22 @@ static int llc_id(int cpu)
>>> return llc;
>>> }
>>> 
>>> +static inline int get_sched_cache_scale(int mul)
>>> +{
>>> + if (!llc_aggr_tolerance)
>>> + return 0;
>>> +
>>> + if (llc_aggr_tolerance == 100)
>>> + return INT_MAX;
>>> +
>>> + return (1 + (llc_aggr_tolerance - 1) * mul);
>>> +}
>>> +
>>> static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>>> {
>>> + unsigned int llc, scale;
>>> struct cacheinfo *ci;
>>> unsigned long rss;
>>> - unsigned int llc;
>>> 
>>> /*
>>> * get_cpu_cacheinfo_level() can not be used
>>> @@ -1252,19 +1266,54 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>>> rss = get_mm_counter(mm, MM_ANONPAGES) +
>>> get_mm_counter(mm, MM_SHMEMPAGES);
>>> 
>>> - return (llc <= (rss * PAGE_SIZE));
>>> + /*
>>> + * Scale the LLC size by 256*llc_aggr_tolerance
>>> + * and compare it to the task's RSS size.
>>> + *
>>> + * Suppose the L3 size is 32MB. If the
>>> + * llc_aggr_tolerance is 1:
>>> + * When the RSS is larger than 32MB, the process
>>> + * is regarded as exceeding the LLC capacity. If
>>> + * the llc_aggr_tolerance is 99:
>>> + * When the RSS is larger than 784GB, the process
>>> + * is regarded as exceeding the LLC capacity because:
>>> + * 784GB = (1 + (99 - 1) * 256) * 32MB
>>> + */
>>> + scale = get_sched_cache_scale(256);
>>> + if (scale == INT_MAX)
>>> + return false;
>>> +
>>> + return ((llc * scale) <= (rss * PAGE_SIZE));
>>> }
>>> 
>>> static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>>> {
>>> - int smt_nr = 1;
>>> + int smt_nr = 1, scale;
>>> 
>>> #ifdef CONFIG_SCHED_SMT
>>> if (sched_smt_active())
>>> smt_nr = cpumask_weight(cpu_smt_mask(cpu));
>>> #endif
>>> + /*
>>> + * Scale the Core number in a LLC by llc_aggr_tolerance
>>> + * and compare it to the task's active threads.
>>> + *
>>> + * Suppose the number of Cores in LLC is 8.
>>> + * Every core has 2 SMTs.
>>> + * If the llc_aggr_tolerance is 1: When the
>>> + * nr_running is larger than 8, the process
>>> + * is regarded as exceeding the LLC capacity.
>>> + * If the llc_aggr_tolerance is 99:
>>> + * When the nr_running is larger than 785,
>>> + * the process is regarded as exceeding
>>> + * the LLC capacity:
>>> + * 785 = 1 + (99 - 1) * 8
>>> + */
>>> + scale = get_sched_cache_scale(1);
>>> + if (scale == INT_MAX)
>>> + return false;
>>> 
>>> - return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
>>> + return ((mm->nr_running_avg * smt_nr) > (scale * per_cpu(sd_llc_size, cpu)));
>>> }
>>> 
>>> static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
>>> @@ -1350,9 +1399,9 @@ static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
>>> long delta = now - rq->cpu_epoch_next;
>>> 
>>> if (delta > 0) {
>>> - n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
>>> + n = (delta + llc_epoch_period - 1) / llc_epoch_period;
>>> rq->cpu_epoch += n;
>>> - rq->cpu_epoch_next += n * EPOCH_PERIOD;
>>> + rq->cpu_epoch_next += n * llc_epoch_period;
>>> __shr_u64(&rq->cpu_runtime, n);
>>> }
>>> 
>>> @@ -1412,7 +1461,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>>> * has only 1 thread, or has too many active threads, invalidate
>>> * its preferred state.
>>> */
>>> - if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
>>> + if (epoch - READ_ONCE(mm->mm_sched_epoch) > llc_epoch_affinity_timeout ||
>>>  get_nr_threads(p) <= 1 ||
>>>  exceed_llc_nr(mm, cpu_of(rq)) ||
>>>  exceed_llc_capacity(mm, cpu_of(rq))) {
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 40798a06e058..15d126bd3728 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -2852,6 +2852,11 @@ extern unsigned int sysctl_numa_balancing_hot_threshold;
>>> #ifdef CONFIG_SCHED_CACHE
>>> extern unsigned int llc_overload_pct;
>>> extern unsigned int llc_imb_pct;
>>> +extern unsigned int llc_aggr_tolerance;
>>> +extern unsigned int llc_epoch_period;
>>> +extern unsigned int llc_epoch_affinity_timeout;
>>> +extern unsigned int llc_enabled;
>>> +void sched_cache_set(bool locked);
>>> #endif
>>> 
>>> #ifdef CONFIG_SCHED_HRTICK
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index 9799e3a9a609..818599ddaaef 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -26,6 +26,49 @@ int max_llcs;
>>> 
>>> static bool sched_cache_present;
>>> 
>>> +unsigned int llc_enabled = 1;
>>> +DEFINE_STATIC_KEY_FALSE(sched_cache_on);
>>> +
>>> +/*
>>> + * Enable/disable cache aware scheduling according to
>>> + * user input and the presence of hardware support.
>>> + */
>>> +static void _sched_cache_set(bool enable, bool locked)
>>> +{
>>> + if (enable) {
>>> + if (locked)
>>> + static_branch_enable_cpuslocked(&sched_cache_on);
>>> + else
>>> + static_branch_enable(&sched_cache_on);
>>> + } else {
>>> + if (locked)
>>> + static_branch_disable_cpuslocked(&sched_cache_on);
>>> + else
>>> + static_branch_disable(&sched_cache_on);
>>> + }
>>> +}
>>> +
>>> +void sched_cache_set(bool locked)
>>> +{
>>> + /* hardware does not support */
>>> + if (!sched_cache_present) {
>>> + if (static_branch_likely(&sched_cache_on))
>>> + _sched_cache_set(false, locked);
>>> +
>>> + return;
>>> + }
>>> +
>>> + /* user wants it or not ?*/
>>> + if (llc_enabled) {
>>> + if (!static_branch_likely(&sched_cache_on))
>>> + _sched_cache_set(true, locked);
>>> +
>>> + } else {
>>> + if (static_branch_likely(&sched_cache_on))
>>> + _sched_cache_set(false, locked);
>>> + }
>>> +}
>>> +
>>> static unsigned int *alloc_new_pref_llcs(unsigned int *old, unsigned int **gc)
>>> {
>>> unsigned int *new = NULL;
>>> @@ -70,8 +113,12 @@ static int resize_llc_pref(bool has_multi_llcs)
>>> * new buffer.
>>> */
>>> tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
>>> - if (!tmp_llc_pref)
>>> - return -ENOMEM;
>>> + if (!tmp_llc_pref) {
>>> + sched_cache_present = false;
>>> + ret = -ENOMEM;
>>> +
>>> + goto out;
>>> + }
>>> 
>>> for_each_present_cpu(i)
>>> *per_cpu_ptr(tmp_llc_pref, i) = NULL;
>>> @@ -89,6 +136,7 @@ static int resize_llc_pref(bool has_multi_llcs)
>>> new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
>>> if (!new) {
>>> ret = -ENOMEM;
>>> + sched_cache_present = false;
>>> 
>>> goto release_old;
>>> }
>>> @@ -126,6 +174,8 @@ static int resize_llc_pref(bool has_multi_llcs)
>>> if (!ret)
>>> max_llcs = new_max_llcs;
>>> 
>>> +out:
>>> + sched_cache_set(true);
>>> return ret;
>>> }
>>> 
>>> -- 
>>> 2.32.0



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 04/23] sched/cache: Make LLC id continuous
  2025-12-23  5:31   ` K Prateek Nayak
@ 2025-12-24  7:08     ` Chen, Yu C
  2025-12-24  8:19       ` K Prateek Nayak
  0 siblings, 1 reply; 111+ messages in thread
From: Chen, Yu C @ 2025-12-24  7:08 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Ingo Molnar, Adam Li, Aaron Lu, Tim Chen, linux-kernel,
	Vincent Guittot, Peter Zijlstra, Gautham R . Shenoy, Tim Chen

Hello Prateek,

On 12/23/2025 1:31 PM, K Prateek Nayak wrote:
> Hello Tim, Chenyu,
> 
> On 12/4/2025 4:37 AM, Tim Chen wrote:
>> +/*
>> + * Assign continuous llc id for the CPU, and return
>> + * the assigned llc id.
>> + */
>> +static int update_llc_id(struct sched_domain *sd,
>> +			 int cpu)
>> +{
>> +	int id = per_cpu(sd_llc_id, cpu), i;
>> +
>> +	if (id >= 0)
>> +		return id;
>> +
>> +	if (sd) {
>> +		/* Look for any assigned id and reuse it.*/
>> +		for_each_cpu(i, sched_domain_span(sd)) {
>> +			id = per_cpu(sd_llc_id, i);
>> +
>> +			if (id >= 0) {
>> +				per_cpu(sd_llc_id, cpu) = id;
>> +				return id;
>> +			}
>> +		}
>> +	}
> 
> I don't really like tying this down to the sched_domain span since
> partition and other weirdness can cause the max_llc count to go
> unnecessarily high. The tl->mask() (from sched_domain_topology_level)
> should give the mask considering all online CPUs and not bothering
> about cpusets.

OK, using the topology_level's mask (tl's mask) should allow us to
  skip the cpuset partition. I just wanted to check if your concern
is about the excessive number of sd_llc_ids caused by the cpuset?

I was under the impression that without this patch, llc_ids are
unique across different partitions.

For example, on vanilla kernel without cache_aware,
suppose 1 LLC has CPU0,1,2,3. Before partition, all
CPUs have the same llc_id 0. Then create a new partition,
mkdir -p /sys/fs/cgroup/cgroup0
echo "3" > /sys/fs/cgroup/cgroup0/cpuset.cpus
echo root > /sys/fs/cgroup/cgroup0/cpuset.cpus.partition
CPU0,1,2 share llc_id 0, and CPU3 has a dedicated llc_id 3.
Do you suggest to let CPU3 reuse llc_id 0, so as to save
more llc_id space?

> 
> How about something like:
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5b17d8e3cb55..c19b1c4e6472 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8270,6 +8270,18 @@ static void cpuset_cpu_active(void)
>   static void cpuset_cpu_inactive(unsigned int cpu)
>   {
>   	if (!cpuhp_tasks_frozen) {
> +		/*
> +		 * This is necessary since offline CPUs are
> +		 * taken out of the tl->mask() and a newly
> +		 * onlined CPU in same LLC will not realize
> +		 * whether it should reuse the LLC ID owned
> +		 * by an offline CPU without knowing the
> +		 * LLC association.
> +		 *
> +		 * Safe to release the reference if this is
> +		 * the last CPU in the LLC going offline.
> +		 */
> +		sched_domain_free_llc_id(cpu);

I'm OK with replacing the domain based cpumask by the topology_level
mask, just wondering whether re-using the llc_id would increase
the risk of race condition - it is possible that, a CPU has different
llc_ids before/after online/offline. Can we assign/reserve a "static"
llc_id for each CPU, whether it is online or offline? In this way,
we don't need to worry about the data synchronization when using
llc_id(). For example, I can think of adjusting the data in
percpu nr_pref_llc[max_llcs] on every CPU whenever a CPU gets
offline/online.

>   		cpuset_update_active_cpus();
>   	} else {
>   		num_cpus_frozen++;
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 41caa22e0680..1378a1cfad18 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -631,6 +631,7 @@ void update_sched_domain_debugfs(void)
>   			i++;
>   		}
>   
> +		debugfs_create_u32("llc_id", 0444, d_cpu, (u32 *)per_cpu_ptr(&sd_llc_id, cpu));
>   		__cpumask_clear_cpu(cpu, sd_sysctl_cpus);
>   	}
>   }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3ceaa9dc9a9e..69fad88b57d8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2142,6 +2142,7 @@ extern int group_balance_cpu(struct sched_group *sg);
>   
>   extern void update_sched_domain_debugfs(void);
>   extern void dirty_sched_domain_sysctl(int cpu);
> +void sched_domain_free_llc_id(int cpu);
>   
>   extern int sched_update_scaling(void);
>   
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index cf643a5ddedd..d6e134767f30 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -20,6 +20,46 @@ void sched_domains_mutex_unlock(void)
>   /* Protected by sched_domains_mutex: */
>   static cpumask_var_t sched_domains_tmpmask;
>   static cpumask_var_t sched_domains_tmpmask2;
> +static cpumask_var_t sched_llc_id_alloc_mask;
> +DEFINE_PER_CPU(int, sd_llc_id) = -1;
> +static int max_llcs = 0;
> +
> +static inline int sched_domain_alloc_llc_id(void)
> +{
> +	int llc_id;
> +
> +	lockdep_assert_held(&sched_domains_mutex);
> +
> +	llc_id = cpumask_first_zero(sched_llc_id_alloc_mask);
> +	BUG_ON((unsigned int)llc_id >= nr_cpumask_bits);
> +	cpumask_set_cpu(llc_id, sched_llc_id_alloc_mask);
> +	++max_llcs;
> +
> +	return llc_id;
> +}
> +
> +void sched_domain_free_llc_id(int cpu)
> +{
> +	int i, llc_id = per_cpu(sd_llc_id, cpu);
> +	bool found = false;
> +
> +	lockdep_assert_cpus_held(); /* For cpu_active_mask. */
> +	guard(mutex)(&sched_domains_mutex);
> +
> +	per_cpu(sd_llc_id, cpu) = -1;
> +	for_each_cpu(i, cpu_active_mask) {
> +		if (per_cpu(sd_llc_id, i) == llc_id) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	/* Allow future hotplugs to claim this ID */
> +	if (!found) {
> +		cpumask_clear_cpu(llc_id, sched_llc_id_alloc_mask);
> +		--max_llcs;

Maybe only allow increasing the value of max_llcs when a new LLC
is detected. That says, max_llcs represents the total number of LLCs
that have ever been detected, even if some of the corresponding
CPUs have been taken offline via runtime hotplug. In this way, the
data synchronization might be simpler, maybe trade additional memory
space for code simplicity?

> +	}
> +}
>   
>   static int __init sched_debug_setup(char *str)
>   {
> @@ -658,7 +698,6 @@ static void destroy_sched_domains(struct sched_domain *sd)
>    */
>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>   DEFINE_PER_CPU(int, sd_llc_size);
> -DEFINE_PER_CPU(int, sd_llc_id);
>   DEFINE_PER_CPU(int, sd_share_id);
>   DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> @@ -684,7 +723,6 @@ static void update_top_cache_domain(int cpu)
>   
>   	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>   	per_cpu(sd_llc_size, cpu) = size;
> -	per_cpu(sd_llc_id, cpu) = id;
>   	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>   
>   	sd = lowest_flag_domain(cpu, SD_CLUSTER);
> @@ -2567,10 +2605,35 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>   
>   	/* Set up domains for CPUs specified by the cpu_map: */
>   	for_each_cpu(i, cpu_map) {
> -		struct sched_domain_topology_level *tl;
> +		struct sched_domain_topology_level *tl, *tl_llc = NULL;
> +		bool done = false;
>   
>   		sd = NULL;
>   		for_each_sd_topology(tl) {
> +			int flags = 0;
> +
> +			if (tl->sd_flags)
> +				flags = (*tl->sd_flags)();
> +
> +			if (flags & SD_SHARE_LLC) {
> +				tl_llc = tl;
> +
> +				/*
> +				 * Entire cpu_map has been covered. We are
> +				 * traversing only to find the highest
> +				 * SD_SHARE_LLC level.
> +				 */
> +				if (done)
> +					continue;
> +			}
> +
> +			/*
> +			 * Since SD_SHARE_LLC is SDF_SHARED_CHILD, we can
> +			 * safely break out if the entire cpu_map has been
> +			 * covered by a child domain.
> +			 */
> +			if (done)
> +				break;
>   
>   			sd = build_sched_domain(tl, cpu_map, attr, sd, i);
>   
> @@ -2579,7 +2642,41 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>   			if (tl == sched_domain_topology)
>   				*per_cpu_ptr(d.sd, i) = sd;
>   			if (cpumask_equal(cpu_map, sched_domain_span(sd)))
> -				break;
> +				done = true;
> +		}
> +
> +		/* First time visiting this CPU. Assign the llc_id. */
> +		if (per_cpu(sd_llc_id, i) == -1) {
> +			int j, llc_id = -1;
> +
> +			/*
> +			 * In case there are no SD_SHARE_LLC domains,
> +			 * each CPU gets its own llc_id. Find the first
> +			 * free bit on the mask and use it.
> +			 */
> +			if (!tl_llc) {
> +				per_cpu(sd_llc_id, i) = sched_domain_alloc_llc_id();
> +				continue;
> +			}
> +
> +			/*
> +			 * Visit all the CPUs of the LLC irrespective of the
> +			 * partition constraints and find if any of them have
> +			 * a valid llc_id.
> +			 */
> +			for_each_cpu(j, tl_llc->mask(tl, i)) {

This is doable, we can use tl rather than domain's mask to
share llc_id among partitions.

> +				llc_id = per_cpu(sd_llc_id, j);
> +
> +				/* Found a valid llc_id for CPU's LLC. */
> +				if (llc_id != -1)
> +					break;
> +			}
> +
> +			/* Valid llc_id not found. Allocate a new one. */
> +			if (llc_id == -1)
> +				llc_id = sched_domain_alloc_llc_id();
> +
> +			per_cpu(sd_llc_id, i) = llc_id;
>   		}
>   	}
>   
> @@ -2759,6 +2856,7 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
>   
>   	zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
>   	zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
> +	zalloc_cpumask_var(&sched_llc_id_alloc_mask, GFP_KERNEL);
>   	zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
>   
>   	arch_update_cpu_topology();
> ---
> 
> AFAICT, "sd_llc_id" isn't compared across different partitions so having
> the CPUs that are actually associated with same physical LLC but across
> different partitions sharing the same "sd_llc_id" shouldn't be a problem.
> 
> Thoughts?
>

This means cpus_share_resources(int this_cpu, int that_cpu)
  should be invoked when this_cpu and that_cpu belong to the same 
partition.
In this way, we do not alter the context of cpus_share_resources(). We can
conduct an audit of the places where cpus_share_resources() is used.

Happy holidays,
Chenyu


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling
  2025-12-24  3:28       ` Yangyu Chen
@ 2025-12-24  7:51         ` Chen, Yu C
  2025-12-24 12:15           ` Yangyu Chen
  0 siblings, 1 reply; 111+ messages in thread
From: Chen, Yu C @ 2025-12-24  7:51 UTC (permalink / raw)
  To: Yangyu Chen, Tim Chen
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Tingyin Duan,
	Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Qais Yousef

On 12/24/2025 11:28 AM, Yangyu Chen wrote:
> 
> 
>> On 24 Dec 2025, at 00:44, Yangyu Chen <cyy@cyyself.name> wrote:
>>
>>> On 23 Dec 2025, at 20:12, Yangyu Chen <cyy@cyyself.name> wrote:
>>>
>>>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>>>
>>>> From: Chen Yu <yu.c.chen@intel.com>
>>>>
>>>> Introduce a set of debugfs knobs to control the enabling of
>>>> and parameters for cache-aware load balancing.
>>>>
>>>> (1) llc_enabled
>>>> llc_enabled acts as the primary switch - users can toggle it to
>>>> enable or disable cache aware load balancing.
>>>>
>>>> (2) llc_aggr_tolerance
>>>> With sched_cache enabled, the scheduler uses a process's RSS as a
>>>> proxy for its LLC footprint to determine if aggregating tasks on the
>>>> preferred LLC could cause cache contention. If RSS exceeds the LLC
>>>> size, aggregation is skipped. Some workloads with large RSS but small
>>>> actual memory footprints may still benefit from aggregation. Since
>>>> the kernel cannot efficiently track per-task cache usage (resctrl is
>>>> user-space only), userspace can provide a more accurate hint.
>>>>
>>>> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
>>>> users control how strictly RSS limits aggregation. Values range from
>>>> 0 to 100:
>>>>
>>>> - 0: Cache-aware scheduling is disabled.
>>>> - 1: Strict; tasks with RSS larger than LLC size are skipped.
>>>> - 100: Aggressive; tasks are aggregated regardless of RSS.
>>>>
>>>
>>> Hi Chen Yu and Tim Chen,
>>>
>>> Maybe we should have something like prctl(PR_LLC_AGGR_TOLERANCE, 100).
>>>
>>> I have tested this version of the patch on my EPYC Milan 7V13 (7763 variant) physical machine, with 32M LLC for each 8-core CCX. I found that I need to tune "llc_aggr_tolerance" to 100, else I can't get cache-aware scheduling to work on Verilated [1] XiangShan [2] running the chacha20 [3] as I mentioned before [4].
>>>
>>
>> In addition, I have investigated why this happens. And finally I
>> realized that's because that workload observed 35596 kB RssAnon on
>> my EPYC Milan Machine, slightly exceeding the LLC size (32M). I
>> have tested it on an EPYC Genoa cloud server with the correct core
>> / cache hierarchy in ACPI table, that shows 31700 kB RssAnon, thus
>> fitting in LLC. I have no idea why my result shows higher RssAnon,
>> since they both run Debian Trixie with the exact same kernel and
>> same executable. But it reminds me we should have a userspace API
>> for that.
>>
> 
> In addition, during profiling the verilator, I found that if scheduled
> to SMTs, it will result in poor performance. Thus, I think we should
> separate the control for rss size with the SMT scale.
> 

Thanks for the investigation. Could you elaborate a little more about
scheduled to SMTs? Do you mean, if every CPU(SMT) in the LLC has 1 running
task, then the performance is impacted? I thought we have
exceed_llc_nr() to check the smt to avoid this?

> It's notable that rss size is not the actual memory footprint. It
> would be better if we could measure the l2_miss event or l3_miss
> event to measure the l3 hit rate. Just for future work.
> 

Yes, in user space, we can collect PMUs events/memory bandwidth via
resctrl to decide whether to set task attributes.

> I'm willing to provide a patch for such a prctl. But I'm busy these
> days, maybe I can have the time to do that after one week.
> 

Sure. We haven't yet decided which interface we can leverage.
Also,  Qais is working on QOS interface[1] - maybe we can build
on his work.

[1] 
https://lore.kernel.org/all/20240820163512.1096301-11-qyousef@layalina.io/

thanks,
Chenyu




^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 04/23] sched/cache: Make LLC id continuous
  2025-12-24  7:08     ` Chen, Yu C
@ 2025-12-24  8:19       ` K Prateek Nayak
  2025-12-24  9:46         ` Chen, Yu C
  0 siblings, 1 reply; 111+ messages in thread
From: K Prateek Nayak @ 2025-12-24  8:19 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Ingo Molnar, Adam Li, Aaron Lu, Tim Chen, linux-kernel,
	Vincent Guittot, Peter Zijlstra, Gautham R . Shenoy, Tim Chen

Hello Chenyu,

On 12/24/2025 12:38 PM, Chen, Yu C wrote:
> Hello Prateek,
> 
> On 12/23/2025 1:31 PM, K Prateek Nayak wrote:
>> Hello Tim, Chenyu,
>>
>> On 12/4/2025 4:37 AM, Tim Chen wrote:
>>> +/*
>>> + * Assign continuous llc id for the CPU, and return
>>> + * the assigned llc id.
>>> + */
>>> +static int update_llc_id(struct sched_domain *sd,
>>> +             int cpu)
>>> +{
>>> +    int id = per_cpu(sd_llc_id, cpu), i;
>>> +
>>> +    if (id >= 0)
>>> +        return id;
>>> +
>>> +    if (sd) {
>>> +        /* Look for any assigned id and reuse it.*/
>>> +        for_each_cpu(i, sched_domain_span(sd)) {
>>> +            id = per_cpu(sd_llc_id, i);
>>> +
>>> +            if (id >= 0) {
>>> +                per_cpu(sd_llc_id, cpu) = id;
>>> +                return id;
>>> +            }
>>> +        }
>>> +    }
>>
>> I don't really like tying this down to the sched_domain span since
>> partition and other weirdness can cause the max_llc count to go
>> unnecessarily high. The tl->mask() (from sched_domain_topology_level)
>> should give the mask considering all online CPUs and not bothering
>> about cpusets.
> 
> OK, using the topology_level's mask (tl's mask) should allow us to
>  skip the cpuset partition. I just wanted to check if your concern
> is about the excessive number of sd_llc_ids caused by the cpuset?

Yes. Basically all cases where sched_domain_span() isn't covering
the entire llc_span - even true for isolated partitions.

> 
> I was under the impression that without this patch, llc_ids are
> unique across different partitions.
> 
> For example, on vanilla kernel without cache_aware,
> suppose 1 LLC has CPU0,1,2,3. Before partition, all
> CPUs have the same llc_id 0. Then create a new partition,
> mkdir -p /sys/fs/cgroup/cgroup0
> echo "3" > /sys/fs/cgroup/cgroup0/cpuset.cpus
> echo root > /sys/fs/cgroup/cgroup0/cpuset.cpus.partition
> CPU0,1,2 share llc_id 0, and CPU3 has a dedicated llc_id 3.
> Do you suggest to let CPU3 reuse llc_id 0, so as to save
> more llc_id space?

Yes. And I think it is logical. Load balancing doesn't happen
across partitions so sd_llc_id reflecting the ID of the
physical LLC shouldn't be a problem.

> 
>>
>> How about something like:
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 5b17d8e3cb55..c19b1c4e6472 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -8270,6 +8270,18 @@ static void cpuset_cpu_active(void)
>>   static void cpuset_cpu_inactive(unsigned int cpu)
>>   {
>>       if (!cpuhp_tasks_frozen) {
>> +        /*
>> +         * This is necessary since offline CPUs are
>> +         * taken out of the tl->mask() and a newly
>> +         * onlined CPU in same LLC will not realize
>> +         * whether it should reuse the LLC ID owned
>> +         * by an offline CPU without knowing the
>> +         * LLC association.
>> +         *
>> +         * Safe to release the reference if this is
>> +         * the last CPU in the LLC going offline.
>> +         */
>> +        sched_domain_free_llc_id(cpu);
> 
> I'm OK with replacing the domain based cpumask by the topology_level
> mask, just wondering whether re-using the llc_id would increase
> the risk of race condition - it is possible that, a CPU has different
> llc_ids before/after online/offline. Can we assign/reserve a "static"
> llc_id for each CPU, whether it is online or offline? In this way,
> we don't need to worry about the data synchronization when using
> llc_id(). For example, I can think of adjusting the data in
> percpu nr_pref_llc[max_llcs] on every CPU whenever a CPU gets
> offline/online.

So I was thinking of of expanding the rq->nr_pref_llc[] if the
max_llc increases but leave it as is if the number of LLCs
decreases. That way we don't have to worry about the
dereferencing past the array boundary.

We can also have a wrapper like:

    struct nr_llc_stats {
        int		nr_llcs;
        struct rcu_head rcu;
        int 		*nr_pref_llc;
    }

And re-allocate and attach it in rq_attach_root() during sd
rebuild. That way, RCU read-side can always grab a reference to
it, enqueue / dequeue don't need to care since it cannot change
under rq_lock, and partition can use call_rcu() to free the old
ones up.

> 
>>           cpuset_update_active_cpus();
>>       } else {
>>           num_cpus_frozen++;
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index 41caa22e0680..1378a1cfad18 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -631,6 +631,7 @@ void update_sched_domain_debugfs(void)
>>               i++;
>>           }
>>   +        debugfs_create_u32("llc_id", 0444, d_cpu, (u32 *)per_cpu_ptr(&sd_llc_id, cpu));
>>           __cpumask_clear_cpu(cpu, sd_sysctl_cpus);
>>       }
>>   }
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 3ceaa9dc9a9e..69fad88b57d8 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2142,6 +2142,7 @@ extern int group_balance_cpu(struct sched_group *sg);
>>     extern void update_sched_domain_debugfs(void);
>>   extern void dirty_sched_domain_sysctl(int cpu);
>> +void sched_domain_free_llc_id(int cpu);
>>     extern int sched_update_scaling(void);
>>   diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index cf643a5ddedd..d6e134767f30 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -20,6 +20,46 @@ void sched_domains_mutex_unlock(void)
>>   /* Protected by sched_domains_mutex: */
>>   static cpumask_var_t sched_domains_tmpmask;
>>   static cpumask_var_t sched_domains_tmpmask2;
>> +static cpumask_var_t sched_llc_id_alloc_mask;
>> +DEFINE_PER_CPU(int, sd_llc_id) = -1;
>> +static int max_llcs = 0;
>> +
>> +static inline int sched_domain_alloc_llc_id(void)
>> +{
>> +    int llc_id;
>> +
>> +    lockdep_assert_held(&sched_domains_mutex);
>> +
>> +    llc_id = cpumask_first_zero(sched_llc_id_alloc_mask);
>> +    BUG_ON((unsigned int)llc_id >= nr_cpumask_bits);
>> +    cpumask_set_cpu(llc_id, sched_llc_id_alloc_mask);
>> +    ++max_llcs;
>> +
>> +    return llc_id;
>> +}
>> +
>> +void sched_domain_free_llc_id(int cpu)
>> +{
>> +    int i, llc_id = per_cpu(sd_llc_id, cpu);
>> +    bool found = false;
>> +
>> +    lockdep_assert_cpus_held(); /* For cpu_active_mask. */
>> +    guard(mutex)(&sched_domains_mutex);
>> +
>> +    per_cpu(sd_llc_id, cpu) = -1;
>> +    for_each_cpu(i, cpu_active_mask) {
>> +        if (per_cpu(sd_llc_id, i) == llc_id) {
>> +            found = true;
>> +            break;
>> +        }
>> +    }
>> +
>> +    /* Allow future hotplugs to claim this ID */
>> +    if (!found) {
>> +        cpumask_clear_cpu(llc_id, sched_llc_id_alloc_mask);
>> +        --max_llcs;
> 
> Maybe only allow increasing the value of max_llcs when a new LLC
> is detected. That says, max_llcs represents the total number of LLCs
> that have ever been detected, even if some of the corresponding
> CPUs have been taken offline via runtime hotplug. In this way, the
> data synchronization might be simpler, maybe trade additional memory
> space for code simplicity?

Ack. That "struct nr_llc_stats" might be over-engineering.
I don't mind working on it later after this goes in.

> 
>> +    }
>> +}
>>     static int __init sched_debug_setup(char *str)
>>   {
>> @@ -658,7 +698,6 @@ static void destroy_sched_domains(struct sched_domain *sd)
>>    */
>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>   DEFINE_PER_CPU(int, sd_llc_size);
>> -DEFINE_PER_CPU(int, sd_llc_id);
>>   DEFINE_PER_CPU(int, sd_share_id);
>>   DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>   DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>> @@ -684,7 +723,6 @@ static void update_top_cache_domain(int cpu)
>>         rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>       per_cpu(sd_llc_size, cpu) = size;
>> -    per_cpu(sd_llc_id, cpu) = id;
>>       rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>>         sd = lowest_flag_domain(cpu, SD_CLUSTER);
>> @@ -2567,10 +2605,35 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>         /* Set up domains for CPUs specified by the cpu_map: */
>>       for_each_cpu(i, cpu_map) {
>> -        struct sched_domain_topology_level *tl;
>> +        struct sched_domain_topology_level *tl, *tl_llc = NULL;
>> +        bool done = false;
>>             sd = NULL;
>>           for_each_sd_topology(tl) {
>> +            int flags = 0;
>> +
>> +            if (tl->sd_flags)
>> +                flags = (*tl->sd_flags)();
>> +
>> +            if (flags & SD_SHARE_LLC) {
>> +                tl_llc = tl;
>> +
>> +                /*
>> +                 * Entire cpu_map has been covered. We are
>> +                 * traversing only to find the highest
>> +                 * SD_SHARE_LLC level.
>> +                 */
>> +                if (done)
>> +                    continue;
>> +            }
>> +
>> +            /*
>> +             * Since SD_SHARE_LLC is SDF_SHARED_CHILD, we can
>> +             * safely break out if the entire cpu_map has been
>> +             * covered by a child domain.
>> +             */
>> +            if (done)
>> +                break;
>>                 sd = build_sched_domain(tl, cpu_map, attr, sd, i);
>>   @@ -2579,7 +2642,41 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>               if (tl == sched_domain_topology)
>>                   *per_cpu_ptr(d.sd, i) = sd;
>>               if (cpumask_equal(cpu_map, sched_domain_span(sd)))
>> -                break;
>> +                done = true;
>> +        }
>> +
>> +        /* First time visiting this CPU. Assign the llc_id. */
>> +        if (per_cpu(sd_llc_id, i) == -1) {
>> +            int j, llc_id = -1;
>> +
>> +            /*
>> +             * In case there are no SD_SHARE_LLC domains,
>> +             * each CPU gets its own llc_id. Find the first
>> +             * free bit on the mask and use it.
>> +             */
>> +            if (!tl_llc) {
>> +                per_cpu(sd_llc_id, i) = sched_domain_alloc_llc_id();
>> +                continue;
>> +            }
>> +
>> +            /*
>> +             * Visit all the CPUs of the LLC irrespective of the
>> +             * partition constraints and find if any of them have
>> +             * a valid llc_id.
>> +             */
>> +            for_each_cpu(j, tl_llc->mask(tl, i)) {
> 
> This is doable, we can use tl rather than domain's mask to
> share llc_id among partitions.
> 
>> +                llc_id = per_cpu(sd_llc_id, j);
>> +
>> +                /* Found a valid llc_id for CPU's LLC. */
>> +                if (llc_id != -1)
>> +                    break;
>> +            }
>> +
>> +            /* Valid llc_id not found. Allocate a new one. */
>> +            if (llc_id == -1)
>> +                llc_id = sched_domain_alloc_llc_id();
>> +
>> +            per_cpu(sd_llc_id, i) = llc_id;
>>           }
>>       }
>>   @@ -2759,6 +2856,7 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
>>         zalloc_cpumask_var(&sched_domains_tmpmask, GFP_KERNEL);
>>       zalloc_cpumask_var(&sched_domains_tmpmask2, GFP_KERNEL);
>> +    zalloc_cpumask_var(&sched_llc_id_alloc_mask, GFP_KERNEL);
>>       zalloc_cpumask_var(&fallback_doms, GFP_KERNEL);
>>         arch_update_cpu_topology();
>> ---
>>
>> AFAICT, "sd_llc_id" isn't compared across different partitions so having
>> the CPUs that are actually associated with same physical LLC but across
>> different partitions sharing the same "sd_llc_id" shouldn't be a problem.
>>
>> Thoughts?
>>
> 
> This means cpus_share_resources(int this_cpu, int that_cpu)
>  should be invoked when this_cpu and that_cpu belong to the same partition.
> In this way, we do not alter the context of cpus_share_resources(). We can
> conduct an audit of the places where cpus_share_resources() is used.

Only case I can think of is a task wakes up after partitioning
and it's wake cpu from a different partition is mistaken to
share the LLC as the current CPU - but the task cannot actually
run on that old CPU and it'll have to take the
select_fallback_rq() path if prev_cpu was selected during
wake_affine().

I don't think it will be such a common occurence to cause an
issue and even without that wake_affine() could still the
prev_cpu if current CPU is busy or via wake_affine_weight().

-- 
Happy holidays!
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 04/23] sched/cache: Make LLC id continuous
  2025-12-24  8:19       ` K Prateek Nayak
@ 2025-12-24  9:46         ` Chen, Yu C
  2025-12-26  3:17           ` K Prateek Nayak
  0 siblings, 1 reply; 111+ messages in thread
From: Chen, Yu C @ 2025-12-24  9:46 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Ingo Molnar, Adam Li, Aaron Lu, Tim Chen, linux-kernel,
	Vincent Guittot, Peter Zijlstra, Gautham R . Shenoy, Tim Chen

On 12/24/2025 4:19 PM, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 12/24/2025 12:38 PM, Chen, Yu C wrote:
>> Hello Prateek,
>>
>> On 12/23/2025 1:31 PM, K Prateek Nayak wrote:
>>> Hello Tim, Chenyu,
>>>
>>> On 12/4/2025 4:37 AM, Tim Chen wrote:

[snip]

>> I'm OK with replacing the domain based cpumask by the topology_level
>> mask, just wondering whether re-using the llc_id would increase
>> the risk of race condition - it is possible that, a CPU has different
>> llc_ids before/after online/offline. Can we assign/reserve a "static"
>> llc_id for each CPU, whether it is online or offline? In this way,
>> we don't need to worry about the data synchronization when using
>> llc_id(). For example, I can think of adjusting the data in
>> percpu nr_pref_llc[max_llcs] on every CPU whenever a CPU gets
>> offline/online.
> 
> So I was thinking of of expanding the rq->nr_pref_llc[] if the
> max_llc increases but leave it as is if the number of LLCs
> decreases. That way we don't have to worry about the
> dereferencing past the array boundary.
> 

Sure, we can do in this way.

> We can also have a wrapper like:
> 
>      struct nr_llc_stats {
>          int		nr_llcs;
>          struct rcu_head rcu;
>          int 		*nr_pref_llc;
>      }
> 
> And re-allocate and attach it in rq_attach_root() during sd
> rebuild. That way, RCU read-side can always grab a reference to
> it, enqueue / dequeue don't need to care since it cannot change
> under rq_lock, and partition can use call_rcu() to free the old
> ones up.
> 

OK, doing it in this direction(and Peter also suggested something like this
in the domain)

>>
>>>            cpuset_update_active_cpus();
>>>        } else {

[snip]

>>> AFAICT, "sd_llc_id" isn't compared across different partitions so having
>>> the CPUs that are actually associated with same physical LLC but across
>>> different partitions sharing the same "sd_llc_id" shouldn't be a problem.
>>>
>>> Thoughts?
>>>
>>
>> This means cpus_share_resources(int this_cpu, int that_cpu)

Actually I was about to say cpus_share_cache().

>>   should be invoked when this_cpu and that_cpu belong to the same partition.
>> In this way, we do not alter the context of cpus_share_resources(). We can
>> conduct an audit of the places where cpus_share_resources() is used.
> 
> Only case I can think of is a task wakes up after partitioning
> and it's wake cpu from a different partition is mistaken to
> share the LLC as the current CPU - but the task cannot actually
> run on that old CPU and it'll have to take the
> select_fallback_rq() path if prev_cpu was selected during
> wake_affine().
> 

OK, make sense.
Actually, prev_cpu might not be chosen by select_task_rq_fair()->
select_idle_sibling(), because fast path select_idle_sibling()
  is expected to be triggered when prev_cpu and the current cpu are in the
same domain in select_task_rq_fair():
cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))
	sd = NULL; //wake affine
curr cpu and prev_cpu are in different partitions, they
are not in the same domains.

> I don't think it will be such a common occurence to cause an
> issue and even without that wake_affine() could still the
> prev_cpu if current CPU is busy or via wake_affine_weight().
> 

I realized that sched_cache has added cpus_share_cache() in
several places, most of which should be related to load
balancing, which should not be a problem if llc_id is shared
among partitions. I'll double check.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes
  2025-12-19  4:01       ` Vern Hao
@ 2025-12-24 10:20         ` Chen, Yu C
  0 siblings, 0 replies; 111+ messages in thread
From: Chen, Yu C @ 2025-12-24 10:20 UTC (permalink / raw)
  To: Vern Hao
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Len Brown, Aubrey Li, Zhao Liu, Chen Yu,
	Adam Li, Aaron Lu, Tim Chen, linux-kernel, Tim Chen,
	Peter Zijlstra, K Prateek Nayak, Gautham R . Shenoy,
	Vincent Guittot, Ingo Molnar

On 12/19/2025 12:01 PM, Vern Hao wrote:
> 
> On 2025/12/16 03:32, Tim Chen wrote:
>> On Fri, 2025-12-12 at 11:34 +0800, Vern Hao wrote:
>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>> With cache-aware scheduling enabled, each task is assigned a
>>>> preferred LLC ID. This allows quick identification of the LLC domain
>>>> where the task prefers to run, similar to numa_preferred_nid in
>>>> NUMA balancing.
>>>>

[snip]

>>>> +
>>>> +    if (mm->mm_sched_cpu != -1) {
>>>> +        mm_sched_llc = llc_id(mm->mm_sched_cpu);
>>>> +
>>>> +#ifdef CONFIG_NUMA_BALANCING
>>>> +        /*
>>>> +         * Don't assign preferred LLC if it
>>>> +         * conflicts with NUMA balancing.
>>>> +         */
>>>> +        if (p->numa_preferred_nid >= 0 &&
>>> I wonder if the restriction here shouldn't be so strict. In Mel Gorman's
>>> patch (e496132ebedd sched/fair: Adjust the allowed NUMA imbalance when
>>> SD_NUMA spans multiple LLCs), the value of the 'imb_numa_nr' is checked
>>> to determine if |SD_NUMA| imbalance is allowed. Could we use this same
>>> check to decide whether or not to perform a cross-numa migration?
>> If we set the preferred LLC that's in a different node other than the 
>> preferred
>> node, the preferred LLC is going to fight with NUMA balancing and bounce
>> tasks back and forth between nodes. NUMA locality is going to affect 
>> performance
>> more so we'll let NUMA preference take precedence.
> 
> I might not have explained myself clearly. I’m questioning whether we 
> need to integrate an imbalance check into the 'sgs->group_type|' is 
> ’group_has_spare|' scenario,  like Mel’s patch, to refine our llc 
> migration decisions.
> 
> just like this:  8 cpus in one LLC, LLC-A has 6 tasks,  LLC-B has 2 
> tasks, if LLC-A has task_a need to migrate to LLC_B, how to deal it ?
> 

If LLC_B is the preferred LLC of task_a, and if the average utilization of
LLC_B has not reached 50%, task_a will be moved to LLC_B. If LLC_A is the
preferred LLC of task_a, then if LLC_A has not reached 50%, task_a will
not be migrated to LLC_B. There are some comments around can_migrate_llc(),
which describe the decision matrix for migration.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling
  2025-12-24  7:51         ` Chen, Yu C
@ 2025-12-24 12:15           ` Yangyu Chen
  0 siblings, 0 replies; 111+ messages in thread
From: Yangyu Chen @ 2025-12-24 12:15 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Tim Chen, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Vincent Guittot, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Madadi Vineeth Reddy, Hillf Danton, Shrikanth Hegde, Jianyong Wu,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Adam Li, Aaron Lu, Tim Chen, linux-kernel, Qais Yousef



> On 24 Dec 2025, at 15:51, Chen, Yu C <yu.c.chen@intel.com> wrote:
> 
> On 12/24/2025 11:28 AM, Yangyu Chen wrote:
>>> On 24 Dec 2025, at 00:44, Yangyu Chen <cyy@cyyself.name> wrote:
>>> 
>>>> On 23 Dec 2025, at 20:12, Yangyu Chen <cyy@cyyself.name> wrote:
>>>> 
>>>>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>>>> 
>>>>> From: Chen Yu <yu.c.chen@intel.com>
>>>>> 
>>>>> Introduce a set of debugfs knobs to control the enabling of
>>>>> and parameters for cache-aware load balancing.
>>>>> 
>>>>> (1) llc_enabled
>>>>> llc_enabled acts as the primary switch - users can toggle it to
>>>>> enable or disable cache aware load balancing.
>>>>> 
>>>>> (2) llc_aggr_tolerance
>>>>> With sched_cache enabled, the scheduler uses a process's RSS as a
>>>>> proxy for its LLC footprint to determine if aggregating tasks on the
>>>>> preferred LLC could cause cache contention. If RSS exceeds the LLC
>>>>> size, aggregation is skipped. Some workloads with large RSS but small
>>>>> actual memory footprints may still benefit from aggregation. Since
>>>>> the kernel cannot efficiently track per-task cache usage (resctrl is
>>>>> user-space only), userspace can provide a more accurate hint.
>>>>> 
>>>>> Introduce /sys/kernel/debug/sched/llc_aggr_tolerance to let
>>>>> users control how strictly RSS limits aggregation. Values range from
>>>>> 0 to 100:
>>>>> 
>>>>> - 0: Cache-aware scheduling is disabled.
>>>>> - 1: Strict; tasks with RSS larger than LLC size are skipped.
>>>>> - 100: Aggressive; tasks are aggregated regardless of RSS.
>>>>> 
>>>> 
>>>> Hi Chen Yu and Tim Chen,
>>>> 
>>>> Maybe we should have something like prctl(PR_LLC_AGGR_TOLERANCE, 100).
>>>> 
>>>> I have tested this version of the patch on my EPYC Milan 7V13 (7763 variant) physical machine, with 32M LLC for each 8-core CCX. I found that I need to tune "llc_aggr_tolerance" to 100, else I can't get cache-aware scheduling to work on Verilated [1] XiangShan [2] running the chacha20 [3] as I mentioned before [4].
>>>> 
>>> 
>>> In addition, I have investigated why this happens. And finally I
>>> realized that's because that workload observed 35596 kB RssAnon on
>>> my EPYC Milan Machine, slightly exceeding the LLC size (32M). I
>>> have tested it on an EPYC Genoa cloud server with the correct core
>>> / cache hierarchy in ACPI table, that shows 31700 kB RssAnon, thus
>>> fitting in LLC. I have no idea why my result shows higher RssAnon,
>>> since they both run Debian Trixie with the exact same kernel and
>>> same executable. But it reminds me we should have a userspace API
>>> for that.
>>> 
>> In addition, during profiling the verilator, I found that if scheduled
>> to SMTs, it will result in poor performance. Thus, I think we should
>> separate the control for rss size with the SMT scale.
> 
> Thanks for the investigation. Could you elaborate a little more about
> scheduled to SMTs? Do you mean, if every CPU(SMT) in the LLC has 1 running
> task, then the performance is impacted? I thought we have
> exceed_llc_nr() to check the smt to avoid this?

The verilator can specify the number of threads being used for the
RTL simulator during compilation. And it cannot be changed at runtime
since it will do static partitioning. Thus, I didn't mean if there
is another thread being scheduled to a SMT in the LLC and we got
poor performance. I mean that the users can allow the verilator to
use more threads larger than the LLC capacity. But I have tested
your case, on my observation with the recent version of XiangShan
+ Verilator + LLVM21 with an 8-thread emulator, it shows 41%(30%
for 1-thread) and 62%(39% for 1-thread) performance degradation on
Raptor Lake and EPYC Milan if another 8 threads are running with a
simple loop. But I think that's only a datapoint. Since both Raptor
Lake and Zen 5 will statically partition the ROB in the CPU backend,
and such workloads will suffer a lot of data cache misses since
they have a very huge instruction footprint. I think SMT performance
is not easy to characterize across different microarchitectures and
workloads, but one thing for sure is that I didn't come across a
situation where a 16-thread emulator on an EPYC machine scheduled
to 1-CCX with 2-SMT is better than 2-CCX with only 1-SMT. That’s
why I think we should split this two user controls, one for RSS and
one for the number of threads.

Thanks,
Yangyu Chen

> 
>> It's notable that rss size is not the actual memory footprint. It
>> would be better if we could measure the l2_miss event or l3_miss
>> event to measure the l3 hit rate. Just for future work.
> 
> Yes, in user space, we can collect PMUs events/memory bandwidth via
> resctrl to decide whether to set task attributes.
> 
>> I'm willing to provide a patch for such a prctl. But I'm busy these
>> days, maybe I can have the time to do that after one week.
> 
> Sure. We haven't yet decided which interface we can leverage.
> Also,  Qais is working on QOS interface[1] - maybe we can build
> on his work.
> 
> [1] https://lore.kernel.org/all/20240820163512.1096301-11-qyousef@layalina.io/
> 
> thanks,
> Chenyu



^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH v2 04/23] sched/cache: Make LLC id continuous
  2025-12-24  9:46         ` Chen, Yu C
@ 2025-12-26  3:17           ` K Prateek Nayak
  0 siblings, 0 replies; 111+ messages in thread
From: K Prateek Nayak @ 2025-12-26  3:17 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
	Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
	Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Aubrey Li, Zhao Liu,
	Chen Yu, Ingo Molnar, Adam Li, Aaron Lu, Tim Chen, linux-kernel,
	Vincent Guittot, Peter Zijlstra, Gautham R . Shenoy, Tim Chen

Hello Chenyu,

On 12/24/2025 3:16 PM, Chen, Yu C wrote:
>> Only case I can think of is a task wakes up after partitioning
>> and it's wake cpu from a different partition is mistaken to
>> share the LLC as the current CPU - but the task cannot actually
>> run on that old CPU and it'll have to take the
>> select_fallback_rq() path if prev_cpu was selected during
>> wake_affine().
>>
> 
> OK, make sense.
> Actually, prev_cpu might not be chosen by select_task_rq_fair()->
> select_idle_sibling(), because fast path select_idle_sibling()
>  is expected to be triggered when prev_cpu and the current cpu are in the
> same domain in select_task_rq_fair():
> cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))
>     sd = NULL; //wake affine

Then again, there are cases where want_affine is false, and
"new_cpu" is initialized to the prev_cpu and we continue down
to select_idle_sibling() since "tmp->flags & sd_flag" is
always false - WF_TTWU matches with SD_BALANCE_WAKE but no
domain sets it anymore afaict.

Again, going through the fallback selection path should be
rare (once after partition on wakeup) and shouldn't cause any
problems for most real-world scenarios.

> curr cpu and prev_cpu are in different partitions, they
> are not in the same domains.
> 
>> I don't think it will be such a common occurence to cause an
>> issue and even without that wake_affine() could still the
>> prev_cpu if current CPU is busy or via wake_affine_weight().
>>
> 
> I realized that sched_cache has added cpus_share_cache() in
> several places, most of which should be related to load
> balancing, which should not be a problem if llc_id is shared
> among partitions. I'll double check.

Thank you!

-- 
Happy Holidays!
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 111+ messages in thread

end of thread, other threads:[~2025-12-26  3:17 UTC | newest]

Thread overview: 111+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-03 23:07 [PATCH v2 00/23] Cache aware scheduling Tim Chen
2025-12-03 23:07 ` [PATCH v2 01/23] sched/cache: Introduce infrastructure for cache-aware load balancing Tim Chen
2025-12-09 11:12   ` Peter Zijlstra
2025-12-09 21:39     ` Tim Chen
2025-12-10  9:37   ` Peter Zijlstra
2025-12-10 13:57     ` Chen, Yu C
2025-12-10 15:11       ` Peter Zijlstra
2025-12-11  9:03   ` Vern Hao
2025-12-16  6:12     ` Chen, Yu C
2025-12-17  1:17       ` Vern Hao
     [not found]   ` <fbf52d91-0605-4608-b9cc-e8cc56115fd5@gmail.com>
2025-12-16 22:30     ` Tim Chen
2025-12-03 23:07 ` [PATCH v2 02/23] sched/cache: Record per-LLC utilization to guide cache-aware scheduling decisions Tim Chen
2025-12-09 11:21   ` Peter Zijlstra
2025-12-10 14:02     ` Chen, Yu C
2025-12-10 15:13       ` Peter Zijlstra
2025-12-10 23:58         ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 03/23] sched/cache: Introduce helper functions to enforce LLC migration policy Tim Chen
2025-12-03 23:07 ` [PATCH v2 04/23] sched/cache: Make LLC id continuous Tim Chen
2025-12-09 11:58   ` Peter Zijlstra
2025-12-15 20:49     ` Tim Chen
2025-12-16  5:31       ` Chen, Yu C
2025-12-16 19:53         ` Tim Chen
2025-12-17  5:25           ` Chen, Yu C
2025-12-23  5:31   ` K Prateek Nayak
2025-12-24  7:08     ` Chen, Yu C
2025-12-24  8:19       ` K Prateek Nayak
2025-12-24  9:46         ` Chen, Yu C
2025-12-26  3:17           ` K Prateek Nayak
2025-12-03 23:07 ` [PATCH v2 05/23] sched/cache: Assign preferred LLC ID to processes Tim Chen
2025-12-09 12:11   ` Peter Zijlstra
2025-12-09 22:34     ` Tim Chen
2025-12-12  3:34   ` Vern Hao
2025-12-15 19:32     ` Tim Chen
2025-12-19  4:01       ` Vern Hao
2025-12-24 10:20         ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 06/23] sched/cache: Track LLC-preferred tasks per runqueue Tim Chen
2025-12-09 12:16   ` Peter Zijlstra
2025-12-09 22:55     ` Tim Chen
2025-12-10  9:42       ` Peter Zijlstra
2025-12-16  0:20         ` Chen, Yu C
2025-12-17 10:04   ` Vern Hao
2025-12-17 12:37     ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter Tim Chen
2025-12-09 13:06   ` Peter Zijlstra
2025-12-09 23:17     ` Tim Chen
2025-12-10 12:43   ` Peter Zijlstra
2025-12-10 18:36     ` Tim Chen
2025-12-10 12:51   ` Peter Zijlstra
2025-12-10 18:49     ` Tim Chen
2025-12-11 10:31       ` Peter Zijlstra
2025-12-15 19:21         ` Tim Chen
2025-12-16 22:45         ` Tim Chen
2025-12-03 23:07 ` [PATCH v2 08/23] sched/cache: Calculate the per runqueue task LLC preference Tim Chen
2025-12-03 23:07 ` [PATCH v2 09/23] sched/cache: Count tasks prefering destination LLC in a sched group Tim Chen
2025-12-10 12:52   ` Peter Zijlstra
2025-12-10 14:05     ` Chen, Yu C
2025-12-10 15:16       ` Peter Zijlstra
2025-12-10 19:00         ` Tim Chen
2025-12-10 23:50         ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 10/23] sched/cache: Check local_group only once in update_sg_lb_stats() Tim Chen
2025-12-03 23:07 ` [PATCH v2 11/23] sched/cache: Prioritize tasks preferring destination LLC during balancing Tim Chen
2025-12-03 23:07 ` [PATCH v2 12/23] sched/cache: Add migrate_llc_task migration type for cache-aware balancing Tim Chen
2025-12-10 13:32   ` Peter Zijlstra
2025-12-16  0:52     ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 13/23] sched/cache: Handle moving single tasks to/from their preferred LLC Tim Chen
2025-12-03 23:07 ` [PATCH v2 14/23] sched/cache: Consider LLC preference when selecting tasks for load balancing Tim Chen
2025-12-10 15:58   ` Peter Zijlstra
2025-12-03 23:07 ` [PATCH v2 15/23] sched/cache: Respect LLC preference in task migration and detach Tim Chen
2025-12-10 16:30   ` Peter Zijlstra
2025-12-16  7:30     ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 16/23] sched/cache: Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node Tim Chen
2025-12-10 16:32   ` Peter Zijlstra
2025-12-10 16:52     ` Peter Zijlstra
2025-12-16  7:36       ` Chen, Yu C
2025-12-16  7:31     ` Chen, Yu C
2025-12-03 23:07 ` [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling Tim Chen
2025-12-10 16:51   ` Peter Zijlstra
2025-12-16  7:40     ` Chen, Yu C
2025-12-17  9:40   ` Aaron Lu
2025-12-17 12:51     ` Chen, Yu C
2025-12-19  3:32       ` Aaron Lu
2025-12-03 23:07 ` [PATCH v2 18/23] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
2025-12-03 23:07 ` [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
2025-12-18  3:59   ` Vern Hao
2025-12-18  8:32     ` Chen, Yu C
2025-12-18  9:42       ` Vern Hao
2025-12-19  3:14         ` K Prateek Nayak
2025-12-19 12:55           ` Chen, Yu C
2025-12-22  2:49             ` Vern Hao
2025-12-22  2:19           ` Vern Hao
2025-12-03 23:07 ` [PATCH v2 20/23] sched/cache: Add user control to adjust the parameters of cache-aware scheduling Tim Chen
2025-12-10 17:02   ` Peter Zijlstra
2025-12-16  7:42     ` Chen, Yu C
2025-12-19  4:14   ` Vern Hao
2025-12-19 13:21     ` Chen, Yu C
2025-12-19 13:39     ` Chen, Yu C
2025-12-23 12:12   ` Yangyu Chen
2025-12-23 16:44     ` Yangyu Chen
2025-12-24  3:28       ` Yangyu Chen
2025-12-24  7:51         ` Chen, Yu C
2025-12-24 12:15           ` Yangyu Chen
2025-12-03 23:07 ` [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing Tim Chen
2025-12-19  5:03   ` Yangyu Chen
2025-12-19 14:41     ` Chen, Yu C
2025-12-19 14:48       ` Yangyu Chen
2025-12-03 23:07 ` [PATCH v2 22/23] -- DO NOT APPLY!!! -- sched/cache/debug: Add ftrace to track the load balance statistics Tim Chen
2025-12-03 23:07 ` [PATCH v2 23/23] -- DO NOT APPLY!!! -- sched/cache/debug: Display the per LLC occupancy for each process via proc fs Tim Chen
2025-12-17  9:59   ` Aaron Lu
2025-12-17 13:01     ` Chen, Yu C
2025-12-19  3:19 ` [PATCH v2 00/23] Cache aware scheduling Aaron Lu
2025-12-19 13:04   ` Chen, Yu C

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).