The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Chen Yu <yu.c.chen@intel.com>, Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Madadi Vineeth Reddy <vineethr@linux.ibm.com>,
	Hillf Danton <hdanton@sina.com>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Jianyong Wu <jianyong.wu@outlook.com>,
	Yangyu Chen <cyy@cyyself.name>,
	Tingyin Duan <tingyin.duan@gmail.com>,
	Vern Hao <vernhao@tencent.com>, Vern Hao <haoxing990@gmail.com>,
	Len Brown <len.brown@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	Aubrey Li <aubrey.li@intel.com>, Zhao Liu <zhao1.liu@intel.com>,
	Chen Yu <yu.chen.surf@gmail.com>,
	Adam Li <adamli@os.amperecomputing.com>,
	Aaron Lu <ziqianlu@bytedance.com>,
	Tim Chen <tim.c.chen@intel.com>, Josh Don <joshdon@google.com>,
	Gavin Guo <gavinguo@igalia.com>,
	Qais Yousef <qyousef@layalina.io>,
	Libo Chen <libchen@purestorage.com>,
	Luo Gengkun <luogengkun2@huawei.com>,
	linux-kernel@vger.kernel.org
Subject: [Patch v4 02/16] sched/cache: Disable cache aware scheduling for processes with high thread counts
Date: Wed, 13 May 2026 13:39:13 -0700	[thread overview]
Message-ID: <d076cd21a8e6c6341d1e2d927e118db770ebb650.1778703694.git.tim.c.chen@linux.intel.com> (raw)
In-Reply-To: <cover.1778703694.git.tim.c.chen@linux.intel.com>

From: Chen Yu <yu.c.chen@intel.com>

A performance regression was observed by Prateek when running hackbench
with many threads per process (high fd count). To avoid this, processes
with a large number of active threads are excluded from cache-aware
scheduling.

With sched_cache enabled, record the number of active threads in each
process during the periodic task_cache_work(). While iterating over
CPUs, if the currently running task belongs to the same process as the
task that launched task_cache_work(), increment the active thread count.

If the number of active threads within the process exceeds the number
of Cores (divided by the SMT number) in the LLC, do not enable
cache-aware scheduling. However, on systems with a smaller number of
CPUs within 1 LLC, like Power10/Power11 with SMT4 and an LLC size of 4,
this check effectively disables cache-aware scheduling for any process.
One possible solution suggested by Peter is to use an LLC-mask instead
of a single LLC value for preference. Once there are a 'few' LLCs as
preference, this constraint becomes a little easier. It could be an
enhancement in the future.

For users who wish to perform task aggregation regardless, a debugfs knob
is provided for tuning in a subsequent change.

Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 48 ++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d883f109ba3..6701911eaaf7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2423,6 +2423,7 @@ struct sched_cache_stat {
 	struct sched_cache_time __percpu *pcpu_sched;
 	raw_spinlock_t lock;
 	unsigned long epoch;
+	u64 nr_running_avg;
 	unsigned long next_scan;
 	int cpu;
 } ____cacheline_aligned_in_smp;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a759ea669d74..808f614fc2d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1384,6 +1384,12 @@ static int llc_id(int cpu)
 	return per_cpu(sd_llc_id, cpu);
 }
 
+static bool invalid_llc_nr(struct mm_struct *mm, int cpu)
+{
+	return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
+			per_cpu(sd_llc_size, cpu));
+}
+
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
 	struct sched_domain *sd;
@@ -1452,7 +1458,7 @@ void mm_init_sched(struct mm_struct *mm,
 	mm->sc_stat.epoch = epoch;
 	mm->sc_stat.cpu = -1;
 	mm->sc_stat.next_scan = jiffies;
-
+	mm->sc_stat.nr_running_avg = 0;
 	/*
 	 * The update to mm->sc_stat should not be reordered
 	 * before initialization to mm's other fields, in case
@@ -1574,7 +1580,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	 * If this process hasn't hit task_cache_work() for a while invalidate
 	 * its preferred state.
 	 */
-	if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT) {
+	if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
+	    invalid_llc_nr(mm, cpu_of(rq))) {
 		if (mm->sc_stat.cpu != -1)
 			mm->sc_stat.cpu = -1;
 	}
@@ -1660,14 +1667,32 @@ static void get_scan_cpumasks(cpumask_var_t cpus, struct task_struct *p)
 	cpumask_copy(cpus, cpu_online_mask);
 }
 
+static inline void update_avg_scale(u64 *avg, u64 sample)
+{
+	int factor = per_cpu(sd_llc_size, raw_smp_processor_id());
+	s64 diff = sample - *avg;
+	u32 divisor;
+
+	/*
+	 * Scale the divisor based on the number of CPUs contained
+	 * in the LLC. This scaling ensures smaller LLC domains use
+	 * a smaller divisor to achieve more precise sensitivity to
+	 * changes in nr_running, while larger LLC domains are capped
+	 * at a maximum divisor of 8 which is the default smoothing
+	 * factor of EWMA in update_avg().
+	 */
+	divisor = clamp_t(u32, (factor >> 2), 2, 8);
+	*avg += div64_s64(diff, divisor);
+}
+
 static void task_cache_work(struct callback_head *work)
 {
 	unsigned long next_scan, now = jiffies;
-	struct task_struct *p = current;
+	struct task_struct *p = current, *cur;
+	int cpu, m_a_cpu = -1, nr_running = 0;
+	unsigned long curr_m_a_occ = 0;
 	struct mm_struct *mm = p->mm;
 	unsigned long m_a_occ = 0;
-	unsigned long curr_m_a_occ = 0;
-	int cpu, m_a_cpu = -1;
 	cpumask_var_t cpus;
 
 	WARN_ON_ONCE(work != &p->cache_work);
@@ -1711,6 +1736,11 @@ static void task_cache_work(struct callback_head *work)
 					m_occ = occ;
 					m_cpu = i;
 				}
+
+				cur = rcu_dereference_all(cpu_rq(i)->curr);
+				if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
+				    cur->mm == mm)
+					nr_running++;
 			}
 
 			/*
@@ -1754,6 +1784,7 @@ static void task_cache_work(struct callback_head *work)
 		mm->sc_stat.cpu = m_a_cpu;
 	}
 
+	update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running);
 	free_cpumask_var(cpus);
 }
 
@@ -10294,6 +10325,13 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
 
+	/* skip cache aware load balance for too many threads */
+	if (invalid_llc_nr(mm, dst_cpu)) {
+		if (mm->sc_stat.cpu != -1)
+			mm->sc_stat.cpu = -1;
+		return mig_unrestricted;
+	}
+
 	if (cpus_share_cache(dst_cpu, cpu))
 		to_pref = true;
 	else if (cpus_share_cache(src_cpu, cpu))
-- 
2.32.0


  parent reply	other threads:[~2026-05-13 20:33 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
2026-05-13 20:39 ` [Patch v4 01/16] sched/cache: Allow only 1 thread of the process to calculate the LLC occupancy Tim Chen
2026-05-13 20:39 ` Tim Chen [this message]
2026-05-13 20:39 ` [Patch v4 03/16] sched/cache: Skip cache-aware scheduling for single-threaded processes Tim Chen
2026-05-13 20:39 ` [Patch v4 04/16] sched/cache: Calculate the LLC size and store it in sched_domain Tim Chen
2026-05-13 20:39 ` [Patch v4 05/16] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
2026-05-13 20:39 ` [Patch v4 06/16] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Tim Chen
2026-05-13 20:39 ` [Patch v4 07/16] sched/cache: Fix rcu warning when accessing sd_llc domain Tim Chen
2026-05-13 20:39 ` [Patch v4 08/16] sched/cache: Fix potential NULL mm pointer access Tim Chen
2026-05-13 20:39 ` [Patch v4 09/16] sched/cache: Annotate lockless accesses to mm->sc_stat.cpu Tim Chen
2026-05-13 20:39 ` [Patch v4 10/16] sched/cache: Fix unpaired account_llc_enqueue/dequeue Tim Chen
2026-05-13 20:39 ` [Patch v4 11/16] sched/cache: Fix checking active load balance by only considering the CFS task Tim Chen
2026-05-13 20:39 ` [Patch v4 12/16] sched/cache: Fix race condition during sched domain rebuild Tim Chen
2026-05-13 20:39 ` [Patch v4 13/16] sched/cache: Fix cache aware scheduling enabling for multi LLCs system Tim Chen
2026-05-13 20:39 ` [Patch v4 14/16] sched/cache: Fix has_multi_llcs iff at least one partition has multiple LLCs Tim Chen
2026-05-13 20:39 ` [Patch v4 15/16] sched/cache: Fix possible overflow when invalidating the preferred CPU Tim Chen
2026-05-13 20:39 ` [Patch v4 16/16] sched/cache: Fix stale preferred_llc for a new task Tim Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d076cd21a8e6c6341d1e2d927e118db770ebb650.1778703694.git.tim.c.chen@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=adamli@os.amperecomputing.com \
    --cc=aubrey.li@intel.com \
    --cc=bsegall@google.com \
    --cc=cyy@cyyself.name \
    --cc=dietmar.eggemann@arm.com \
    --cc=gavinguo@igalia.com \
    --cc=haoxing990@gmail.com \
    --cc=hdanton@sina.com \
    --cc=jianyong.wu@outlook.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=len.brown@intel.com \
    --cc=libchen@purestorage.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luogengkun2@huawei.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=qyousef@layalina.io \
    --cc=rostedt@goodmis.org \
    --cc=sshegde@linux.ibm.com \
    --cc=tim.c.chen@intel.com \
    --cc=tingyin.duan@gmail.com \
    --cc=vernhao@tencent.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vineethr@linux.ibm.com \
    --cc=vschneid@redhat.com \
    --cc=yu.c.chen@intel.com \
    --cc=yu.chen.surf@gmail.com \
    --cc=zhao1.liu@intel.com \
    --cc=ziqianlu@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox