The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>, Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>, Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>, Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>, Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 06/16] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling
Date: Wed, 13 May 2026 13:39:17 -0700	[thread overview]
Message-ID: <1c62cc060ba2b33d7b1f0ed98b3390128edbae93.1778703694.git.tim.c.chen@linux.intel.com> (raw)
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>

From: Chen Yu <yu.c.chen@intel.com>

Introduce a set of debugfs knobs to control how aggressively the
cache aware scheduling does the task aggregation.

(1) aggr_tolerance
With sched_cache enabled, the scheduler uses a process's footprint
as a proxy for its LLC footprint to determine if aggregating tasks
on the preferred LLC could cause cache contention. If the footprint
exceeds the LLC size, aggregation is skipped. Since the kernel
cannot efficiently track per-task cache usage (resctrl is
user-space only), userspace can provide a more accurate hint.

Introduce /sys/kernel/debug/sched/llc_balancing/aggr_tolerance to
let users control how strictly footprint limits aggregation. Values
range from 0 to 100:
  - 0: Cache-aware scheduling is disabled.
  - 1: Strict; tasks with footprint larger than LLC size are skipped.
  - >=100: Aggressive; tasks are aggregated regardless of footprint.
For example, with a 32MB L3 cache:

  - aggr_tolerance=1 -> tasks with footprint > 32MB are skipped.
  - aggr_tolerance=99 -> tasks with footprint > 784GB are skipped
    (784GB = (1 + (99 - 1) * 256) * 32MB).
Similarly, /sys/kernel/debug/sched/llc_balancing/aggr_tolerance also
controls how strictly the number of active threads is considered when
doing cache aware load balance. The number of SMTs is also considered.
High SMT counts reduce the aggregation capacity, preventing excessive
task aggregation on SMT-heavy systems like Power10/Power11.

Yangyu suggested introducing separate aggregation controls for the
number of active threads and memory footprint checks. Since there are
plans to add per-process/task group controls, fine-grained tunables are
deferred to that implementation.

(2) epoch_period, epoch_affinity_timeout,
    imb_pct, overaggr_pct are also turned into tunables.

Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: Jianyong Wu <jianyong.wu@outlook.com>
Suggested-by: Yangyu Chen <cyy@cyyself.name>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/debug.c | 10 +++++++
 kernel/sched/fair.c  | 68 ++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h |  5 ++++
 3 files changed, 75 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2eae67cd2ba2..fe569539e888 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -670,6 +670,16 @@ static __init int sched_init_debug(void)
 	llc = debugfs_create_dir("llc_balancing", debugfs_sched);
 	debugfs_create_file("enabled", 0644, llc, NULL,
 			    &sched_cache_enable_fops);
+	debugfs_create_u32("aggr_tolerance", 0644, llc,
+			   &llc_aggr_tolerance);
+	debugfs_create_u32("epoch_period", 0644, llc,
+			   &llc_epoch_period);
+	debugfs_create_u32("epoch_affinity_timeout", 0644, llc,
+			   &llc_epoch_affinity_timeout);
+	debugfs_create_u32("overaggr_pct", 0644, llc,
+			   &llc_overaggr_pct);
+	debugfs_create_u32("imb_pct", 0644, llc,
+			   &llc_imb_pct);
 #endif
 
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a10116ffe0d1..01ce646792ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1375,6 +1375,11 @@ static void set_next_buddy(struct sched_entity *se);
  */
 #define EPOCH_PERIOD	(HZ / 100)	/* 10 ms */
 #define EPOCH_LLC_AFFINITY_TIMEOUT	5	/* 50 ms */
+__read_mostly unsigned int llc_aggr_tolerance	= 1;
+__read_mostly unsigned int llc_epoch_period	= EPOCH_PERIOD;
+__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
+__read_mostly unsigned int llc_imb_pct		= 20;
+__read_mostly unsigned int llc_overaggr_pct	= 50;
 
 static int llc_id(int cpu)
 {
@@ -1384,11 +1389,25 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
 
+static inline int get_sched_cache_scale(int mul)
+{
+	unsigned int tol = READ_ONCE(llc_aggr_tolerance);
+
+	if (!tol)
+		return 0;
+
+	if (tol >= 100)
+		return INT_MAX;
+
+	return (1 + (tol - 1) * mul);
+}
+
 static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 {
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned long llc, footprint;
 	struct sched_domain *sd;
+	int scale;
 
 	guard(rcu)();
 
@@ -1404,7 +1423,28 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 		llc = sd->llc_bytes;
 		footprint = READ_ONCE(mm->sc_stat.footprint);
 
-		return (llc < (footprint * PAGE_SIZE));
+		/*
+		 * Scale the LLC size by 256*llc_aggr_tolerance
+		 * and compare it to the task's footprint.
+		 *
+		 * Suppose the L3 size is 32MB. If the
+		 * llc_aggr_tolerance is 1:
+		 * When the footprint is larger than 32MB, the
+		 * process is regarded as exceeding the LLC
+		 * capacity. If the llc_aggr_tolerance is 99:
+		 * When the footprint is larger than 784GB, the
+		 * process is regarded as exceeding the LLC
+		 * capacity:
+		 * 784GB = (1 + (99 - 1) * 256) * 32MB
+		 * If the llc_aggr_tolerance is 100:
+		 * ignore the footprint and do the aggregation
+		 * anyway.
+		 */
+		scale = get_sched_cache_scale(256);
+		if (scale == INT_MAX)
+			return false;
+
+		return ((llc * (u64)scale) < (footprint * PAGE_SIZE));
 	}
 #endif
 	return false;
@@ -1413,11 +1453,21 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
 static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p,
 			   int cpu)
 {
+	int scale;
+
 	if (get_nr_threads(p) <= 1)
 		return true;
 
+	/*
+	 * Scale the number of 'cores' in a LLC by llc_aggr_tolerance
+	 * and compare it to the task's active threads.
+	 */
+	scale = get_sched_cache_scale(1);
+	if (scale == INT_MAX)
+		return false;
+
 	return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
-			per_cpu(sd_llc_size, cpu));
+			(scale * per_cpu(sd_llc_size, cpu)));
 }
 
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
@@ -1513,13 +1563,14 @@ static inline void __update_mm_sched(struct rq *rq,
 {
 	lockdep_assert_held(&rq->cpu_epoch_lock);
 
+	unsigned int period = max(READ_ONCE(llc_epoch_period), 1U);
 	unsigned long n, now = jiffies;
 	long delta = now - rq->cpu_epoch_next;
 
 	if (delta > 0) {
-		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		n = (delta + period - 1) / period;
 		rq->cpu_epoch += n;
-		rq->cpu_epoch_next += n * EPOCH_PERIOD;
+		rq->cpu_epoch_next += n * period;
 		__shr_u64(&rq->cpu_runtime, n);
 	}
 
@@ -1611,7 +1662,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	 * If this process hasn't hit task_cache_work() for a while invalidate
 	 * its preferred state.
 	 */
-	if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
+	if (epoch - READ_ONCE(mm->sc_stat.epoch) > llc_epoch_affinity_timeout ||
 	    invalid_llc_nr(mm, p, cpu_of(rq)) ||
 	    exceed_llc_capacity(mm, cpu_of(rq))) {
 		if (mm->sc_stat.cpu != -1)
@@ -1740,7 +1791,8 @@ static void task_cache_work(struct callback_head *work)
 
 	/* only 1 thread is allowed to scan */
 	if (!try_cmpxchg(&mm->sc_stat.next_scan, &next_scan,
-			 now + EPOCH_PERIOD))
+			 now + max_t(unsigned long,
+				     READ_ONCE(llc_epoch_period), 1)))
 		return;
 
 	curr_cpu = task_cpu(p);
@@ -10232,7 +10284,7 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
  */
 static bool fits_llc_capacity(unsigned long util, unsigned long max)
 {
-	u32 aggr_pct = 50;
+	u32 aggr_pct = llc_overaggr_pct;
 
 	/*
 	 * For single core systems, raise the aggregation
@@ -10252,7 +10304,7 @@ static bool fits_llc_capacity(unsigned long util, unsigned long max)
  */
 /* Allows dst util to be bigger than src util by up to bias percent */
 #define util_greater(util1, util2) \
-	((util1) * 100 > (util2) * 120)
+	((util1) * 100 > (util2) * (100 + llc_imb_pct))
 
 static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
 					 unsigned long *cap)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f499d5dd1130..27409399137c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4072,6 +4072,11 @@ static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct
 DECLARE_STATIC_KEY_FALSE(sched_cache_present);
 DECLARE_STATIC_KEY_FALSE(sched_cache_active);
 extern int sysctl_sched_cache_user;
+extern unsigned int llc_aggr_tolerance;
+extern unsigned int llc_epoch_period;
+extern unsigned int llc_epoch_affinity_timeout;
+extern unsigned int llc_imb_pct;
+extern unsigned int llc_overaggr_pct;
 
 static inline bool sched_cache_enabled(void)
 {
-- 
2.32.0


  parent reply	other threads:[~2026-05-13 20:33 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
2026-05-13 20:39 ` [Patch v4 01/16] sched/cache: Allow only 1 thread of the process to calculate the LLC occupancy Tim Chen
2026-05-13 20:39 ` [Patch v4 02/16] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
2026-05-13 20:39 ` [Patch v4 03/16] sched/cache: Skip cache-aware scheduling for single-threaded processes Tim Chen
2026-05-13 20:39 ` [Patch v4 04/16] sched/cache: Calculate the LLC size and store it in sched_domain Tim Chen
2026-05-13 20:39 ` [Patch v4 05/16] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
2026-05-13 20:39 ` Tim Chen [this message]
2026-05-13 20:39 ` [Patch v4 07/16] sched/cache: Fix rcu warning when accessing sd_llc domain Tim Chen
2026-05-13 20:39 ` [Patch v4 08/16] sched/cache: Fix potential NULL mm pointer access Tim Chen
2026-05-13 20:39 ` [Patch v4 09/16] sched/cache: Annotate lockless accesses to mm->sc_stat.cpu Tim Chen
2026-05-13 20:39 ` [Patch v4 10/16] sched/cache: Fix unpaired account_llc_enqueue/dequeue Tim Chen
2026-05-13 20:39 ` [Patch v4 11/16] sched/cache: Fix checking active load balance by only considering the CFS task Tim Chen
2026-05-13 20:39 ` [Patch v4 12/16] sched/cache: Fix race condition during sched domain rebuild Tim Chen
2026-05-13 20:39 ` [Patch v4 13/16] sched/cache: Fix cache aware scheduling enabling for multi LLCs system Tim Chen
2026-05-13 20:39 ` [Patch v4 14/16] sched/cache: Fix has_multi_llcs iff at least one partition has multiple LLCs Tim Chen
2026-05-13 20:39 ` [Patch v4 15/16] sched/cache: Fix possible overflow when invalidating the preferred CPU Tim Chen
2026-05-13 20:39 ` [Patch v4 16/16] sched/cache: Fix stale preferred_llc for a new task Tim Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1c62cc060ba2b33d7b1f0ed98b3390128edbae93.1778703694.git.tim.c.chen@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=adamli@os.amperecomputing.com \
    --cc=aubrey.li@intel.com \
    --cc=bsegall@google.com \
    --cc=cyy@cyyself.name \
    --cc=dietmar.eggemann@arm.com \
    --cc=gavinguo@igalia.com \
    --cc=haoxing990@gmail.com \
    --cc=hdanton@sina.com \
    --cc=jianyong.wu@outlook.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=len.brown@intel.com \
    --cc=libchen@purestorage.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luogengkun2@huawei.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=qyousef@layalina.io \
    --cc=rostedt@goodmis.org \
    --cc=sshegde@linux.ibm.com \
    --cc=tim.c.chen@intel.com \
    --cc=tingyin.duan@gmail.com \
    --cc=vernhao@tencent.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vineethr@linux.ibm.com \
    --cc=vschneid@redhat.com \
    --cc=yu.c.chen@intel.com \
    --cc=yu.chen.surf@gmail.com \
    --cc=zhao1.liu@intel.com \
    --cc=ziqianlu@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox