public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] sched: Introduce Cache aware scheduling
@ 2025-04-21  3:23 Chen Yu
  2025-04-21  3:24 ` [RFC PATCH 1/5] sched: Cache aware load-balancing Chen Yu
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Chen Yu @ 2025-04-21  3:23 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton,
	linux-kernel, Chen Yu

This is a respin of the cache-aware scheduling proposed by Peter[1].
In this patch set, some known issues in [1] were addressed, and the performance
regression was investigated and mitigated.

Cache-aware scheduling aims to aggregate tasks with potential shared resources
into the same cache domain. This approach enhances cache locality, thereby optimizing
system performance by reducing cache misses and improving data access efficiency.

In the current implementation, threads within the same process are considered as
entities that potentially share resources. Cache-aware scheduling monitors the CPU
occupancy of each cache domain for every process. Based on this monitoring, it endeavors
to migrate threads within a given process to its cache-hot domains, with the goal of
maximizing cache locality.

Patch 1 constitutes the fundamental cache-aware scheduling. It is the same patch as [1].
Patch 2 comprises a series of fixes for Patch 1, including compiling warnings and functional
fixes.
Patch 3 fixes performance degradation that arise from excessive task migrations within the
preferred LLC domain.
Patch 4 further alleviates performance regressions when the preferred LLC becomes saturated.
Patch 5 introduces ftrace events, which is used to track task migrations triggered by wakeup
and load balancer. This addition facilitate performance regression analysis.

The patch set is applied on top of v6.14 sched/core,
commit 4ba7518327c6 ("sched/debug: Print the local group's asym_prefer_cpu")

schbench was tested on EMR and Zen3 Milan. An improvement in tail latency was observed when 
the LLC was underloaded; however, some regressions were still evident when the LLC was
saturated. Additionally, the load balance should be adjusted to further address these
regressions.

[1] https://lore.kernel.org/all/20250325120952.GJ36322@noisy.programming.kicks-ass.net/


Chen Yu (4):
  sched: Several fixes for cache aware scheduling
  sched: Avoid task migration within its preferred LLC
  sched: Inhibit cache aware scheduling if the preferred LLC is over
    aggregated
  sched: Add ftrace to track task migration and load balance within and
    across LLC

Peter Zijlstra (1):
  sched: Cache aware load-balancing

 include/linux/mm_types.h       |  44 ++++
 include/linux/sched.h          |   4 +
 include/linux/sched/topology.h |   4 +
 include/trace/events/sched.h   |  51 ++++
 init/Kconfig                   |   4 +
 kernel/fork.c                  |   5 +
 kernel/sched/core.c            |  13 +-
 kernel/sched/fair.c            | 461 +++++++++++++++++++++++++++++++--
 kernel/sched/features.h        |   1 +
 kernel/sched/sched.h           |   8 +
 10 files changed, 569 insertions(+), 26 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 1/5] sched: Cache aware load-balancing
  2025-04-21  3:23 [RFC PATCH 0/5] sched: Introduce Cache aware scheduling Chen Yu
@ 2025-04-21  3:24 ` Chen Yu
  2025-04-21  3:24 ` [RFC PATCH 2/5] sched: Several fixes for cache aware scheduling Chen Yu
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Chen Yu @ 2025-04-21  3:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton,
	linux-kernel

From: Peter Zijlstra <peterz@infradead.org>

Hi all,

One of the many things on the eternal todo list has been finishing the
below hackery.

It is an attempt at modelling cache affinity -- and while the patch
really only targets LLC, it could very well be extended to also apply to
clusters (L2). Specifically any case of multiple cache domains inside a
node.

Anyway, I wrote this about a year ago, and I mentioned this at the
recent OSPM conf where Gautham and Prateek expressed interest in playing
with this code.

So here goes, very rough and largely unproven code ahead :-)

It applies to current tip/master, but I know it will fail the __percpu
validation that sits in -next, although that shouldn't be terribly hard
to fix up.

As is, it only computes a CPU inside the LLC that has the highest recent
runtime, this CPU is then used in the wake-up path to steer towards this
LLC and in task_hot() to limit migrations away from it.

More elaborate things could be done, notably there is an XXX in there
somewhere about finding the best LLC inside a NODE (interaction with
NUMA_BALANCING).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/mm_types.h |  44 ++++++
 include/linux/sched.h    |   4 +
 init/Kconfig             |   4 +
 kernel/fork.c            |   5 +
 kernel/sched/core.c      |  13 +-
 kernel/sched/fair.c      | 330 +++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h     |   8 +
 7 files changed, 388 insertions(+), 20 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 56d07edd01f9..013291c6aaa2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -893,6 +893,12 @@ struct mm_cid {
 };
 #endif
 
+struct mm_sched {
+	u64 runtime;
+	unsigned long epoch;
+	unsigned long occ;
+};
+
 struct kioctx_table;
 struct iommu_mm_data;
 struct mm_struct {
@@ -983,6 +989,17 @@ struct mm_struct {
 		 */
 		raw_spinlock_t cpus_allowed_lock;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		/*
+		 * Track per-cpu-per-process occupancy as a proxy for cache residency.
+		 * See account_mm_sched() and ...
+		 */
+		struct mm_sched __percpu *pcpu_sched;
+		raw_spinlock_t mm_sched_lock;
+		unsigned long mm_sched_epoch;
+		int mm_sched_cpu;
+#endif
+
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
 #endif
@@ -1393,6 +1410,33 @@ static inline unsigned int mm_cid_size(void)
 static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
 #endif /* CONFIG_SCHED_MM_CID */
 
+#ifdef CONFIG_SCHED_CACHE
+extern void mm_init_sched(struct mm_struct *mm, struct mm_sched *pcpu_sched);
+
+static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
+{
+	struct mm_sched *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
+	if (!pcpu_sched)
+		return -ENOMEM;
+
+	mm_init_sched(mm, pcpu_sched);
+	return 0;
+}
+
+#define mm_alloc_sched(...)	alloc_hooks(mm_alloc_sched_noprof(__VA_ARGS__))
+
+static inline void mm_destroy_sched(struct mm_struct *mm)
+{
+	free_percpu(mm->pcpu_sched);
+	mm->pcpu_sched = NULL;
+}
+#else /* !CONFIG_SCHED_CACHE */
+
+static inline int mm_alloc_sched(struct mm_struct *mm) { return 0; }
+static inline void mm_destroy_sched(struct mm_struct *mm) { }
+
+#endif /* CONFIG_SCHED_CACHE */
+
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..d0e4cda2b3cd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1399,6 +1399,10 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_SCHED_CACHE
+	struct callback_head		cache_work;
+#endif
+
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
 	u32 rseq_len;
diff --git a/init/Kconfig b/init/Kconfig
index b2c045c71d7f..7e0104efd138 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -950,6 +950,10 @@ config NUMA_BALANCING
 
 	  This system will be inactive on UMA systems.
 
+config SCHED_CACHE
+	bool "Cache aware scheduler"
+	default y
+
 config NUMA_BALANCING_DEFAULT_ENABLED
 	bool "Automatically enable NUMA aware memory/task placement"
 	default y
diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b..974869841e62 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1331,6 +1331,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (mm_alloc_cid(mm, p))
 		goto fail_cid;
 
+	if (mm_alloc_sched(mm))
+		goto fail_sched;
+
 	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
 				     NR_MM_COUNTERS))
 		goto fail_pcpu;
@@ -1340,6 +1343,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	return mm;
 
 fail_pcpu:
+	mm_destroy_sched(mm);
+fail_sched:
 	mm_destroy_cid(mm);
 fail_cid:
 	destroy_context(mm);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 79692f85643f..5a92c02df97b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4524,6 +4524,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->migration_pending = NULL;
 #endif
 	init_sched_mm_cid(p);
+	init_sched_mm(p);
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
@@ -8528,6 +8529,7 @@ static struct kmem_cache *task_group_cache __ro_after_init;
 
 void __init sched_init(void)
 {
+	unsigned long now = jiffies;
 	unsigned long ptr = 0;
 	int i;
 
@@ -8602,7 +8604,7 @@ void __init sched_init(void)
 		raw_spin_lock_init(&rq->__lock);
 		rq->nr_running = 0;
 		rq->calc_load_active = 0;
-		rq->calc_load_update = jiffies + LOAD_FREQ;
+		rq->calc_load_update = now + LOAD_FREQ;
 		init_cfs_rq(&rq->cfs);
 		init_rt_rq(&rq->rt);
 		init_dl_rq(&rq->dl);
@@ -8646,7 +8648,7 @@ void __init sched_init(void)
 		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
 		rq->balance_callback = &balance_push_callback;
 		rq->active_balance = 0;
-		rq->next_balance = jiffies;
+		rq->next_balance = now;
 		rq->push_cpu = 0;
 		rq->cpu = i;
 		rq->online = 0;
@@ -8658,7 +8660,7 @@ void __init sched_init(void)
 
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ_COMMON
-		rq->last_blocked_load_update_tick = jiffies;
+		rq->last_blocked_load_update_tick = now;
 		atomic_set(&rq->nohz_flags, 0);
 
 		INIT_CSD(&rq->nohz_csd, nohz_csd_func, rq);
@@ -8683,6 +8685,11 @@ void __init sched_init(void)
 
 		rq->core_cookie = 0UL;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+		raw_spin_lock_init(&rq->cpu_epoch_lock);
+		rq->cpu_epoch_next = now;
+#endif
+
 		zalloc_cpumask_var_node(&rq->scratch_mask, GFP_KERNEL, cpu_to_node(i));
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5e1bd9e8464c..23ea35dbd381 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1166,10 +1166,229 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
 	return delta_exec;
 }
 
-static inline void update_curr_task(struct task_struct *p, s64 delta_exec)
+#ifdef CONFIG_SCHED_CACHE
+
+/*
+ * XXX numbers come from a place the sun don't shine -- probably wants to be SD
+ * tunable or so.
+ */
+#define EPOCH_PERIOD	(HZ/100)	/* 10 ms */
+#define EPOCH_OLD	5		/* 50 ms */
+
+void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched)
+{
+	unsigned long epoch;
+	int i;
+
+	for_each_possible_cpu(i) {
+		struct mm_sched *pcpu_sched = per_cpu_ptr(_pcpu_sched, i);
+		struct rq *rq = cpu_rq(i);
+
+		pcpu_sched->runtime = 0;
+		pcpu_sched->epoch = epoch = rq->cpu_epoch;
+		pcpu_sched->occ = -1;
+	}
+
+	raw_spin_lock_init(&mm->mm_sched_lock);
+	mm->mm_sched_epoch = epoch;
+	mm->mm_sched_cpu = -1;
+
+	smp_store_release(&mm->pcpu_sched, _pcpu_sched);
+}
+
+/* because why would C be fully specified */
+static __always_inline void __shr_u64(u64 *val, unsigned int n)
+{
+	if (n >= 64) {
+		*val = 0;
+		return;
+	}
+	*val >>= n;
+}
+
+static inline void __update_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
+{
+	lockdep_assert_held(&rq->cpu_epoch_lock);
+
+	unsigned long n, now = jiffies;
+	long delta = now - rq->cpu_epoch_next;
+
+	if (delta > 0) {
+		n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+		rq->cpu_epoch += n;
+		rq->cpu_epoch_next += n * EPOCH_PERIOD;
+		__shr_u64(&rq->cpu_runtime, n);
+	}
+
+	n = rq->cpu_epoch - pcpu_sched->epoch;
+	if (n) {
+		pcpu_sched->epoch += n;
+		__shr_u64(&pcpu_sched->runtime, n);
+	}
+}
+
+static unsigned long fraction_mm_sched(struct rq *rq, struct mm_sched *pcpu_sched)
+{
+	guard(raw_spinlock_irqsave)(&rq->cpu_epoch_lock);
+
+	__update_mm_sched(rq, pcpu_sched);
+
+	/*
+	 * Runtime is a geometric series (r=0.5) and as such will sum to twice
+	 * the accumulation period, this means the multiplcation here should
+	 * not overflow.
+	 */
+	return div64_u64(NICE_0_LOAD * pcpu_sched->runtime, rq->cpu_runtime + 1);
+}
+
+static inline
+void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
+{
+	struct mm_struct *mm = p->mm;
+	struct mm_sched *pcpu_sched;
+	unsigned long epoch;
+
+	/*
+	 * init_task and kthreads don't be having no mm
+	 */
+	if (!mm || !mm->pcpu_sched)
+		return;
+
+	pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
+
+	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
+		__update_mm_sched(rq, pcpu_sched);
+		pcpu_sched->runtime += delta_exec;
+		rq->cpu_runtime += delta_exec;
+		epoch = rq->cpu_epoch;
+	}
+
+	/*
+	 * If this task hasn't hit task_cache_work() for a while, invalidate
+	 * it's preferred state.
+	 */
+	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_OLD) {
+		mm->mm_sched_cpu = -1;
+		pcpu_sched->occ = -1;
+	}
+}
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p)
+{
+	struct callback_head *work = &p->cache_work;
+	struct mm_struct *mm = p->mm;
+
+	if (!mm || !mm->pcpu_sched)
+		return;
+
+	if (mm->mm_sched_epoch == rq->cpu_epoch)
+		return;
+
+	guard(raw_spinlock)(&mm->mm_sched_lock);
+
+	if (mm->mm_sched_epoch == rq->cpu_epoch)
+		return;
+
+	if (work->next == work) {
+		task_work_add(p, work, TWA_RESUME);
+		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
+	}
+}
+
+static void task_cache_work(struct callback_head *work)
+{
+	struct task_struct *p = current;
+	struct mm_struct *mm = p->mm;
+	unsigned long m_a_occ = 0;
+	int cpu, m_a_cpu = -1;
+	cpumask_var_t cpus;
+
+	WARN_ON_ONCE(work != &p->cache_work);
+
+	work->next = work;
+
+	if (p->flags & PF_EXITING)
+		return;
+
+	if (!alloc_cpumask_var(&cpus, GFP_KERNEL))
+		return;
+
+	scoped_guard (cpus_read_lock) {
+		cpumask_copy(cpus, cpu_online_mask);
+
+		for_each_cpu(cpu, cpus) {
+			/* XXX sched_cluster_active */
+			struct sched_domain *sd = per_cpu(sd_llc, cpu);
+			unsigned long occ, m_occ = 0, a_occ = 0;
+			int m_cpu = -1, nr = 0, i;
+
+			for_each_cpu(i, sched_domain_span(sd)) {
+				occ = fraction_mm_sched(cpu_rq(i),
+							per_cpu_ptr(mm->pcpu_sched, i));
+				a_occ += occ;
+				if (occ > m_occ) {
+					m_occ = occ;
+					m_cpu = i;
+				}
+				nr++;
+				trace_printk("(%d) occ: %ld m_occ: %ld m_cpu: %d nr: %d\n",
+					     per_cpu(sd_llc_id, i), occ, m_occ, m_cpu, nr);
+			}
+
+			a_occ /= nr;
+			if (a_occ > m_a_occ) {
+				m_a_occ = a_occ;
+				m_a_cpu = m_cpu;
+			}
+
+			trace_printk("(%d) a_occ: %ld m_a_occ: %ld\n",
+				     per_cpu(sd_llc_id, cpu), a_occ, m_a_occ);
+
+			for_each_cpu(i, sched_domain_span(sd)) {
+				/* XXX threshold ? */
+				per_cpu_ptr(mm->pcpu_sched, i)->occ = a_occ;
+			}
+
+			cpumask_andnot(cpus, cpus, sched_domain_span(sd));
+		}
+	}
+
+	/*
+	 * If the max average cache occupancy is 'small' we don't care.
+	 */
+	if (m_a_occ < (NICE_0_LOAD >> EPOCH_OLD))
+		m_a_cpu = -1;
+
+	mm->mm_sched_cpu = m_a_cpu;
+
+	free_cpumask_var(cpus);
+}
+
+void init_sched_mm(struct task_struct *p)
+{
+	struct callback_head *work = &p->cache_work;
+	init_task_work(work, task_cache_work);
+	work->next = work;
+}
+
+#else
+
+static inline void account_mm_sched(struct rq *rq, struct task_struct *p,
+				    s64 delta_exec) { }
+
+
+void init_sched_mm(struct task_struct *p) { }
+
+static void task_tick_cache(struct rq *rq, struct task_struct *p) { }
+
+#endif
+
+static inline
+void update_curr_task(struct rq *rq, struct task_struct *p, s64 delta_exec)
 {
 	trace_sched_stat_runtime(p, delta_exec);
 	account_group_exec_runtime(p, delta_exec);
+	account_mm_sched(rq, p, delta_exec);
 	cgroup_account_cputime(p, delta_exec);
 }
 
@@ -1215,7 +1434,7 @@ s64 update_curr_common(struct rq *rq)
 
 	delta_exec = update_curr_se(rq, &donor->se);
 	if (likely(delta_exec > 0))
-		update_curr_task(donor, delta_exec);
+		update_curr_task(rq, donor, delta_exec);
 
 	return delta_exec;
 }
@@ -1244,7 +1463,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	if (entity_is_task(curr)) {
 		struct task_struct *p = task_of(curr);
 
-		update_curr_task(p, delta_exec);
+		update_curr_task(rq, p, delta_exec);
 
 		/*
 		 * If the fair_server is active, we need to account for the
@@ -7843,7 +8062,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 * per-cpu select_rq_mask usage
 	 */
 	lockdep_assert_irqs_disabled();
-
+again:
 	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
 	    asym_fits_cpu(task_util, util_min, util_max, target))
 		return target;
@@ -7881,7 +8100,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	/* Check a recently used CPU as a potential idle candidate: */
 	recent_used_cpu = p->recent_used_cpu;
 	p->recent_used_cpu = prev;
-	if (recent_used_cpu != prev &&
+	if (prev == p->wake_cpu &&
+	    recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
 	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
@@ -7934,6 +8154,18 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	if (prev != p->wake_cpu && !cpus_share_cache(prev, p->wake_cpu)) {
+		/*
+		 * Most likely select_cache_cpu() will have re-directed
+		 * the wakeup, but getting here means the preferred cache is
+		 * too busy, so re-try with the actual previous.
+		 *
+		 * XXX wake_affine is lost for this pass.
+		 */
+		prev = target = p->wake_cpu;
+		goto again;
+	}
+
 	/*
 	 * For cluster machines which have lower sharing cache like L2 or
 	 * LLC Tag, we tend to find an idle CPU in the target's cluster
@@ -8556,6 +8788,40 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	return target;
 }
 
+#ifdef CONFIG_SCHED_CACHE
+static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
+
+static int select_cache_cpu(struct task_struct *p, int prev_cpu)
+{
+	struct mm_struct *mm = p->mm;
+	int cpu;
+
+	if (!mm || p->nr_cpus_allowed == 1)
+		return prev_cpu;
+
+	cpu = mm->mm_sched_cpu;
+	if (cpu < 0)
+		return prev_cpu;
+
+
+	if (static_branch_likely(&sched_numa_balancing) &&
+	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
+		/*
+		 * XXX look for max occupancy inside prev_cpu's node
+		 */
+		return prev_cpu;
+	}
+
+	return cpu;
+}
+#else
+static int select_cache_cpu(struct task_struct *p, int prev_cpu)
+{
+	return prev_cpu;
+}
+#endif
+
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
@@ -8581,6 +8847,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	 * required for stable ->cpus_allowed
 	 */
 	lockdep_assert_held(&p->pi_lock);
+	guard(rcu)();
+
 	if (wake_flags & WF_TTWU) {
 		record_wakee(p);
 
@@ -8588,6 +8856,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		    cpumask_test_cpu(cpu, p->cpus_ptr))
 			return cpu;
 
+		new_cpu = prev_cpu = select_cache_cpu(p, prev_cpu);
+
 		if (!is_rd_overutilized(this_rq()->rd)) {
 			new_cpu = find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >= 0)
@@ -8598,7 +8868,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
 	}
 
-	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		/*
 		 * If both 'cpu' and 'prev_cpu' are part of this domain,
@@ -8631,7 +8900,6 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		/* Fast path */
 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 	}
-	rcu_read_unlock();
 
 	return new_cpu;
 }
@@ -9281,6 +9549,17 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 	if (sysctl_sched_migration_cost == 0)
 		return 0;
 
+#ifdef CONFIG_SCHED_CACHE
+	if (p->mm && p->mm->pcpu_sched) {
+		/*
+		 * XXX things like Skylake have non-inclusive L3 and might not
+		 * like this L3 centric view. What to do about L2 stickyness ?
+		 */
+		return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
+		       per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
+	}
+#endif
+
 	delta = rq_clock_task(env->src_rq) - p->se.exec_start;
 
 	return delta < (s64)sysctl_sched_migration_cost;
@@ -9292,27 +9571,25 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
  * Returns 0, if task migration is not affected by locality.
  * Returns a negative value, if task migration improves locality i.e migration preferred.
  */
-static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle)
 {
 	struct numa_group *numa_group = rcu_dereference(p->numa_group);
 	unsigned long src_weight, dst_weight;
 	int src_nid, dst_nid, dist;
 
-	if (!static_branch_likely(&sched_numa_balancing))
-		return 0;
-
-	if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+	if (!p->numa_faults)
 		return 0;
 
-	src_nid = cpu_to_node(env->src_cpu);
-	dst_nid = cpu_to_node(env->dst_cpu);
+	src_nid = cpu_to_node(src_cpu);
+	dst_nid = cpu_to_node(dst_cpu);
 
 	if (src_nid == dst_nid)
 		return 0;
 
 	/* Migrating away from the preferred node is always bad. */
 	if (src_nid == p->numa_preferred_nid) {
-		if (env->src_rq->nr_running > env->src_rq->nr_preferred_running)
+		struct rq *src_rq = cpu_rq(src_cpu);
+		if (src_rq->nr_running > src_rq->nr_preferred_running)
 			return 1;
 		else
 			return 0;
@@ -9323,7 +9600,7 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 		return -1;
 
 	/* Leaving a core idle is often worse than degrading locality. */
-	if (env->idle == CPU_IDLE)
+	if (idle)
 		return 0;
 
 	dist = node_distance(src_nid, dst_nid);
@@ -9338,7 +9615,24 @@ static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 	return src_weight - dst_weight;
 }
 
+static long migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+	if (!static_branch_likely(&sched_numa_balancing))
+		return 0;
+
+	if (!(env->sd->flags & SD_NUMA))
+		return 0;
+
+	return __migrate_degrades_locality(p, env->src_cpu, env->dst_cpu,
+					   env->idle == CPU_IDLE);
+}
+
 #else
+static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle)
+{
+	return 0;
+}
+
 static inline long migrate_degrades_locality(struct task_struct *p,
 					     struct lb_env *env)
 {
@@ -13098,8 +13392,8 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
  */
 static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 {
-	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &curr->se;
+	struct cfs_rq *cfs_rq;
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
@@ -13109,6 +13403,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
 
+	task_tick_cache(rq, curr);
+
 	update_misfit_status(curr, rq);
 	check_update_overutilized_status(task_rq(curr));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5a6a503eb6d..1b6d7e374bc3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1173,6 +1173,12 @@ struct rq {
 	u64			clock_pelt_idle_copy;
 	u64			clock_idle_copy;
 #endif
+#ifdef CONFIG_SCHED_CACHE
+	raw_spinlock_t		cpu_epoch_lock;
+	u64			cpu_runtime;
+	unsigned long		cpu_epoch;
+	unsigned long		cpu_epoch_next;
+#endif
 
 	atomic_t		nr_iowait;
 
@@ -3887,6 +3893,8 @@ static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) { }
 static inline void init_sched_mm_cid(struct task_struct *t) { }
 #endif /* !CONFIG_SCHED_MM_CID */
 
+extern void init_sched_mm(struct task_struct *p);
+
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
 #ifdef CONFIG_SMP
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 2/5] sched: Several fixes for cache aware scheduling
  2025-04-21  3:23 [RFC PATCH 0/5] sched: Introduce Cache aware scheduling Chen Yu
  2025-04-21  3:24 ` [RFC PATCH 1/5] sched: Cache aware load-balancing Chen Yu
@ 2025-04-21  3:24 ` Chen Yu
  2025-04-21  3:25 ` [RFC PATCH 3/5] sched: Avoid task migration within its preferred LLC Chen Yu
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Chen Yu @ 2025-04-21  3:24 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton,
	linux-kernel, Chen Yu

1. Fix the compile errors on per-CPU allocation.
2. Enqueue tasks to the target CPU instead of the current CPU;
   otherwise, the per-CPU occupancy will be messed up.
3. Fix the NULL LLC sched domain issue(Libo Chen).
4. Avoid duplicated epoch check in task_tick_cache()
5. Introduce sched feature SCHED_CACHE to control cache aware
   scheduling

TBD suggestion in previous version:
move cache_work from per task to per mm_struct, consider the actual cpu
capacity in fraction_mm_sched() (Abel Wu)

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/mm_types.h |  4 ++--
 kernel/sched/fair.c      | 15 +++++++++------
 kernel/sched/features.h  |  1 +
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 013291c6aaa2..9de4a0a13c4d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1411,11 +1411,11 @@ static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumas
 #endif /* CONFIG_SCHED_MM_CID */
 
 #ifdef CONFIG_SCHED_CACHE
-extern void mm_init_sched(struct mm_struct *mm, struct mm_sched *pcpu_sched);
+extern void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *pcpu_sched);
 
 static inline int mm_alloc_sched_noprof(struct mm_struct *mm)
 {
-	struct mm_sched *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
+	struct mm_sched __percpu *pcpu_sched = alloc_percpu_noprof(struct mm_sched);
 	if (!pcpu_sched)
 		return -ENOMEM;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23ea35dbd381..22b5830e7e4e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1175,7 +1175,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
 #define EPOCH_PERIOD	(HZ/100)	/* 10 ms */
 #define EPOCH_OLD	5		/* 50 ms */
 
-void mm_init_sched(struct mm_struct *mm, struct mm_sched *_pcpu_sched)
+void mm_init_sched(struct mm_struct *mm, struct mm_sched __percpu *_pcpu_sched)
 {
 	unsigned long epoch;
 	int i;
@@ -1254,7 +1254,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	if (!mm || !mm->pcpu_sched)
 		return;
 
-	pcpu_sched = this_cpu_ptr(p->mm->pcpu_sched);
+	pcpu_sched = per_cpu_ptr(p->mm->pcpu_sched, cpu_of(rq));
 
 	scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
 		__update_mm_sched(rq, pcpu_sched);
@@ -1286,9 +1286,6 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
 
 	guard(raw_spinlock)(&mm->mm_sched_lock);
 
-	if (mm->mm_sched_epoch == rq->cpu_epoch)
-		return;
-
 	if (work->next == work) {
 		task_work_add(p, work, TWA_RESUME);
 		WRITE_ONCE(mm->mm_sched_epoch, rq->cpu_epoch);
@@ -1322,6 +1319,9 @@ static void task_cache_work(struct callback_head *work)
 			unsigned long occ, m_occ = 0, a_occ = 0;
 			int m_cpu = -1, nr = 0, i;
 
+			if (!sd)
+				continue;
+
 			for_each_cpu(i, sched_domain_span(sd)) {
 				occ = fraction_mm_sched(cpu_rq(i),
 							per_cpu_ptr(mm->pcpu_sched, i));
@@ -8796,6 +8796,9 @@ static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 	struct mm_struct *mm = p->mm;
 	int cpu;
 
+	if (!sched_feat(SCHED_CACHE))
+		return prev_cpu;
+
 	if (!mm || p->nr_cpus_allowed == 1)
 		return prev_cpu;
 
@@ -9550,7 +9553,7 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 		return 0;
 
 #ifdef CONFIG_SCHED_CACHE
-	if (p->mm && p->mm->pcpu_sched) {
+	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->pcpu_sched) {
 		/*
 		 * XXX things like Skylake have non-inclusive L3 and might not
 		 * like this L3 centric view. What to do about L2 stickyness ?
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..d2af7bfd36bf 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -87,6 +87,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_UTIL, true)
 
+SCHED_FEAT(SCHED_CACHE, true)
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
  * in a single rq->lock section. Default disabled because the
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 3/5] sched: Avoid task migration within its preferred LLC
  2025-04-21  3:23 [RFC PATCH 0/5] sched: Introduce Cache aware scheduling Chen Yu
  2025-04-21  3:24 ` [RFC PATCH 1/5] sched: Cache aware load-balancing Chen Yu
  2025-04-21  3:24 ` [RFC PATCH 2/5] sched: Several fixes for cache aware scheduling Chen Yu
@ 2025-04-21  3:25 ` Chen Yu
  2025-04-21  3:25 ` [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated Chen Yu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Chen Yu @ 2025-04-21  3:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton,
	linux-kernel, Chen Yu

It was found that when running schbench, there is a
significant amount of in-LLC task migration, even if
the wakee is woken up on its preferred LLC. This
leads to core-to-core latency and impairs performance.

Inhibit task migration if the wakee is already in its
preferred LLC. Meanwhile, prevent the load balancer
from treating the task as cache-hot if this task is
being migrated out of its preferred LLC, rather than
comparing the occupancy between CPUs.

With this enhancement applied, the in-LLC task migration
has been reduced a lot(use PATCH 5/5 to verify).

It was found that when schbench is running, there is a
significant amount of in-LLC task migration, even if the
wakee is woken up on its preferred LLC. This leads to
core-to-core latency and impairs performance.

Inhibit task migration if the wakee is already in its
preferred LLC. Meanwhile, prevent the load balancer from
treating the task as cache-hot if this task is being migrated
out of its preferred LLC, instead of comparing occupancy
between CPUs directly.

With this enhancement applied, the in-LLC task migration has
been reduced significantly, (use PATCH 5/5 to verify).

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22b5830e7e4e..1733eb83042c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8806,6 +8806,12 @@ static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 	if (cpu < 0)
 		return prev_cpu;
 
+	/*
+	 * No need to migrate the task if previous and preferred CPU
+	 * are in the same LLC.
+	 */
+	if (cpus_share_cache(prev_cpu, cpu))
+		return prev_cpu;
 
 	if (static_branch_likely(&sched_numa_balancing) &&
 	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
@@ -9553,14 +9559,13 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 		return 0;
 
 #ifdef CONFIG_SCHED_CACHE
-	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->pcpu_sched) {
-		/*
-		 * XXX things like Skylake have non-inclusive L3 and might not
-		 * like this L3 centric view. What to do about L2 stickyness ?
-		 */
-		return per_cpu_ptr(p->mm->pcpu_sched, env->src_cpu)->occ >
-		       per_cpu_ptr(p->mm->pcpu_sched, env->dst_cpu)->occ;
-	}
+	/*
+	 * Don't migrate task out of its preferred LLC.
+	 */
+	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->mm_sched_cpu >= 0 &&
+	    cpus_share_cache(env->src_cpu, p->mm->mm_sched_cpu) &&
+	    !cpus_share_cache(env->src_cpu, env->dst_cpu))
+		return 1;
 #endif
 
 	delta = rq_clock_task(env->src_rq) - p->se.exec_start;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated
  2025-04-21  3:23 [RFC PATCH 0/5] sched: Introduce Cache aware scheduling Chen Yu
                   ` (2 preceding siblings ...)
  2025-04-21  3:25 ` [RFC PATCH 3/5] sched: Avoid task migration within its preferred LLC Chen Yu
@ 2025-04-21  3:25 ` Chen Yu
  2025-04-24  9:22   ` Madadi Vineeth Reddy
  2025-04-21  3:25 ` [RFC PATCH 5/5] sched: Add ftrace to track task migration and load balance within and across LLC Chen Yu
  2025-04-29  3:47 ` [RFC PATCH 0/5] sched: Introduce Cache aware scheduling K Prateek Nayak
  5 siblings, 1 reply; 14+ messages in thread
From: Chen Yu @ 2025-04-21  3:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton,
	linux-kernel, Chen Yu

It is found that when the process's preferred LLC gets saturated by too many
threads, task contention is very frequent and causes performance regression.

Save the per LLC statistics calculated by periodic load balance. The statistics
include the average utilization and the average number of runnable tasks.
The task wakeup path for cache aware scheduling manipulates these statistics
to inhibit cache aware scheduling to avoid performance regression. When either
the average utilization of the preferred LLC has reached 25%, or the average
number of runnable tasks has exceeded 1/3 of the LLC weight, the cache aware
wakeup is disabled. Only when the process has more threads than the LLC weight
will this restriction be enabled.

Running schbench via mmtests on a Xeon platform, which has 2 sockets, each socket
has 60 Cores/120 CPUs. The DRAM interleave is enabled across NUMA nodes via BIOS,
so there are 2 "LLCs" in 1 NUMA node.

compare-mmtests.pl --directory work/log --benchmark schbench --names baseline,sched_cache
                                    baselin             sched_cach
                                   baseline            sched_cache
Lat 50.0th-qrtle-1          6.00 (   0.00%)        6.00 (   0.00%)
Lat 90.0th-qrtle-1         10.00 (   0.00%)        9.00 (  10.00%)
Lat 99.0th-qrtle-1         29.00 (   0.00%)       13.00 (  55.17%)
Lat 99.9th-qrtle-1         35.00 (   0.00%)       21.00 (  40.00%)
Lat 20.0th-qrtle-1        266.00 (   0.00%)      266.00 (   0.00%)
Lat 50.0th-qrtle-2          8.00 (   0.00%)        6.00 (  25.00%)
Lat 90.0th-qrtle-2         10.00 (   0.00%)       10.00 (   0.00%)
Lat 99.0th-qrtle-2         19.00 (   0.00%)       18.00 (   5.26%)
Lat 99.9th-qrtle-2         27.00 (   0.00%)       29.00 (  -7.41%)
Lat 20.0th-qrtle-2        533.00 (   0.00%)      507.00 (   4.88%)
Lat 50.0th-qrtle-4          6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
Lat 99.0th-qrtle-4         14.00 (   0.00%)        9.00 (  35.71%)
Lat 99.9th-qrtle-4         22.00 (   0.00%)       14.00 (  36.36%)
Lat 20.0th-qrtle-4       1070.00 (   0.00%)      995.00 (   7.01%)
Lat 50.0th-qrtle-8          5.00 (   0.00%)        5.00 (   0.00%)
Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
Lat 99.0th-qrtle-8         12.00 (   0.00%)       11.00 (   8.33%)
Lat 99.9th-qrtle-8         19.00 (   0.00%)       16.00 (  15.79%)
Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2140.00 (   0.00%)
Lat 50.0th-qrtle-16         6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-16         7.00 (   0.00%)        5.00 (  28.57%)
Lat 99.0th-qrtle-16        12.00 (   0.00%)       10.00 (  16.67%)
Lat 99.9th-qrtle-16        17.00 (   0.00%)       14.00 (  17.65%)
Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4200.00 (   2.23%)
Lat 50.0th-qrtle-32         6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-32         8.00 (   0.00%)        6.00 (  25.00%)
Lat 99.0th-qrtle-32        12.00 (   0.00%)       10.00 (  16.67%)
Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8528.00 (  -0.38%)
Lat 50.0th-qrtle-64         6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-64         8.00 (   0.00%)        8.00 (   0.00%)
Lat 99.0th-qrtle-64        12.00 (   0.00%)       12.00 (   0.00%)
Lat 99.9th-qrtle-64        17.00 (   0.00%)       17.00 (   0.00%)
Lat 20.0th-qrtle-64     17120.00 (   0.00%)    17120.00 (   0.00%)
Lat 50.0th-qrtle-128        7.00 (   0.00%)        7.00 (   0.00%)
Lat 90.0th-qrtle-128        9.00 (   0.00%)        9.00 (   0.00%)
Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
Lat 99.9th-qrtle-128       20.00 (   0.00%)       20.00 (   0.00%)
Lat 20.0th-qrtle-128    31776.00 (   0.00%)    30496.00 (   4.03%)
Lat 50.0th-qrtle-239        9.00 (   0.00%)        9.00 (   0.00%)
Lat 90.0th-qrtle-239       14.00 (   0.00%)       18.00 ( -28.57%)
Lat 99.0th-qrtle-239       43.00 (   0.00%)       56.00 ( -30.23%)
Lat 99.9th-qrtle-239      106.00 (   0.00%)      483.00 (-355.66%)
Lat 20.0th-qrtle-239    30176.00 (   0.00%)    29984.00 (   0.64%)

We can see overall latency improvement and some throughput degradation
when the system gets saturated.

Also, we run schbench (old version) on an EPYC 7543 system, which has
4 NUMA nodes, and each node has 4 LLCs. Monitor the 99.0th latency:

case                    load            baseline(std%)  compare%( std%)
normal                  4-mthreads-1-workers     1.00 (  6.47)   +9.02 (  4.68)
normal                  4-mthreads-2-workers     1.00 (  3.25)  +28.03 (  8.76)
normal                  4-mthreads-4-workers     1.00 (  6.67)   -4.32 (  2.58)
normal                  4-mthreads-8-workers     1.00 (  2.38)   +1.27 (  2.41)
normal                  4-mthreads-16-workers    1.00 (  5.61)   -8.48 (  4.39)
normal                  4-mthreads-31-workers    1.00 (  9.31)   -0.22 (  9.77)

When the LLC is underloaded, the latency improvement is observed. When the LLC
gets saturated, we observe some degradation.

The aggregation of tasks will move tasks towards the preferred LLC
pretty quickly during wake ups. However load balance will tend to move
tasks away from the aggregated LLC. The two migrations are in the
opposite directions and tend to bounce tasks between LLCs. Such task
migrations should be impeded in load balancing as long as the home LLC.
We're working on fixing up the load balancing path to address such issues.

Co-developed-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |   4 ++
 kernel/sched/fair.c            | 101 ++++++++++++++++++++++++++++++++-
 2 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 198bb5cc1774..9625d9d762f5 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -78,6 +78,10 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned long	util_avg;
+	u64		nr_avg;
+#endif
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1733eb83042c..f74d8773c811 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8791,6 +8791,58 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 #ifdef CONFIG_SCHED_CACHE
 static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
 
+/* expected to be protected by rcu_read_lock() */
+static bool get_llc_stats(int cpu, int *nr, int *weight, unsigned long *util)
+{
+	struct sched_domain_shared *sd_share;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (!sd_share)
+		return false;
+
+	*nr = READ_ONCE(sd_share->nr_avg);
+	*util = READ_ONCE(sd_share->util_avg);
+	*weight = per_cpu(sd_llc_size, cpu);
+
+	return true;
+}
+
+static bool valid_target_cpu(int cpu, struct task_struct *p)
+{
+	int nr_running, llc_weight;
+	unsigned long util, llc_cap;
+
+	if (!get_llc_stats(cpu, &nr_running, &llc_weight,
+			   &util))
+		return false;
+
+	llc_cap = llc_weight * SCHED_CAPACITY_SCALE;
+
+	/*
+	 * If this process has many threads, be careful to avoid
+	 * task stacking on the preferred LLC, by checking the system's
+	 * utilization and runnable tasks. Otherwise, if this
+	 * process does not have many threads, honor the cache
+	 * aware wakeup.
+	 */
+	if (get_nr_threads(p) < llc_weight)
+		return true;
+
+	/*
+	 * Check if it exceeded 25% of average utiliazation,
+	 * or if it exceeded 33% of CPUs. This is a magic number
+	 * that did not cause heavy cache contention on Xeon or
+	 * Zen.
+	 */
+	if (util * 4 >= llc_cap)
+		return false;
+
+	if (nr_running * 3 >= llc_weight)
+		return false;
+
+	return true;
+}
+
 static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct mm_struct *mm = p->mm;
@@ -8813,6 +8865,9 @@ static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 	if (cpus_share_cache(prev_cpu, cpu))
 		return prev_cpu;
 
+	if (!valid_target_cpu(cpu, p))
+		return prev_cpu;
+
 	if (static_branch_likely(&sched_numa_balancing) &&
 	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
 		/*
@@ -9564,7 +9619,8 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 	 */
 	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->mm_sched_cpu >= 0 &&
 	    cpus_share_cache(env->src_cpu, p->mm->mm_sched_cpu) &&
-	    !cpus_share_cache(env->src_cpu, env->dst_cpu))
+	    !cpus_share_cache(env->src_cpu, env->dst_cpu) &&
+	     !valid_target_cpu(env->dst_cpu, p))
 		return 1;
 #endif
 
@@ -10634,6 +10690,48 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
 
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Save this sched group's statistic for later use:
+ * The task wakeup and load balance can make better
+ * decision based on these statistics.
+ */
+static void update_sg_if_llc(struct lb_env *env, struct sg_lb_stats *sgs,
+			     struct sched_group *group)
+{
+	/* Find the sched domain that spans this group. */
+	struct sched_domain *sd = env->sd->child;
+	struct sched_domain_shared *sd_share;
+	u64 last_nr;
+
+	if (!sched_feat(SCHED_CACHE) || env->idle == CPU_NEWLY_IDLE)
+		return;
+
+	/* only care the sched domain that spans 1 LLC */
+	if (!sd || !(sd->flags & SD_SHARE_LLC) ||
+	    !sd->parent || (sd->parent->flags & SD_SHARE_LLC))
+		return;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared,
+				   cpumask_first(sched_group_span(group))));
+	if (!sd_share)
+		return;
+
+	last_nr = READ_ONCE(sd_share->nr_avg);
+	update_avg(&last_nr, sgs->sum_nr_running);
+
+	if (likely(READ_ONCE(sd_share->util_avg) != sgs->group_util))
+		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
+
+	WRITE_ONCE(sd_share->nr_avg, last_nr);
+}
+#else
+static inline void update_sg_if_llc(struct lb_env *env, struct sg_lb_stats *sgs,
+				    struct sched_group *group)
+{
+}
+#endif
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10723,6 +10821,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
+	update_sg_if_llc(env, sgs, group);
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type == group_overloaded)
 		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 5/5] sched: Add ftrace to track task migration and load balance within and across LLC
  2025-04-21  3:23 [RFC PATCH 0/5] sched: Introduce Cache aware scheduling Chen Yu
                   ` (3 preceding siblings ...)
  2025-04-21  3:25 ` [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated Chen Yu
@ 2025-04-21  3:25 ` Chen Yu
  2025-04-29  3:47 ` [RFC PATCH 0/5] sched: Introduce Cache aware scheduling K Prateek Nayak
  5 siblings, 0 replies; 14+ messages in thread
From: Chen Yu @ 2025-04-21  3:25 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton,
	linux-kernel, Chen Yu

[Not for upstream]
Introduce these ftrace events for debugging purposes.
The task migration activity is an important indicator to
infer the performance regression.

Use the following bpftrace script to capture the task migrations:

tracepoint:sched:sched_attach_task
{
  $src_cpu = args->src_cpu;
  $dst_cpu = args->dst_cpu;
  $src_llc = args->src_llc;
  $dst_llc = args->dst_llc;
  $idle = args->idle;

  if ($src_llc == $dst_llc) {
    @lb_mig_1llc[$idle] = count();
  } else {
    @lb_mig_2llc[$idle] = count();
  }
}

tracepoint:sched:sched_select_task_rq
{
  $new_cpu = args->new_cpu;
  $old_cpu = args->old_cpu;
  $new_llc = args->new_llc;
  $old_llc = args->old_llc;

  if ($new_cpu != $old_cpu) {
    if ($new_llc == $old_llc) {
      @wake_mig_1llc[$new_llc] = count();
    } else {
      @wake_mig_2llc = count();
    }
  }
}

interval:s:10
{
        time("\n%H:%M:%S scheduler statistics: \n");
        print(@lb_mig_1llc);
        clear(@lb_mig_1llc);
        print(@lb_mig_2llc);
        clear(@lb_mig_2llc);
        print(@wake_mig_1llc);
        clear(@wake_mig_1llc);
        print(@wake_mig_2llc);
        clear(@wake_mig_2llc);
}

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/trace/events/sched.h | 51 ++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          | 24 ++++++++++++-----
 2 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 3bec9fb73a36..9995e09525ed 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -10,6 +10,57 @@
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
 
+TRACE_EVENT(sched_attach_task,
+
+	TP_PROTO(int src_cpu, int dst_cpu, int src_llc, int dst_llc, int idle),
+
+	TP_ARGS(src_cpu, dst_cpu, src_llc, dst_llc, idle),
+
+	TP_STRUCT__entry(
+		__field(	int,	src_cpu		)
+		__field(	int,	dst_cpu		)
+		__field(	int,	src_llc		)
+		__field(	int,	dst_llc		)
+		__field(	int,	idle		)
+	),
+
+	TP_fast_assign(
+		__entry->src_cpu	= src_cpu;
+		__entry->dst_cpu	= dst_cpu;
+		__entry->src_llc	= src_llc;
+		__entry->dst_llc	= dst_llc;
+		__entry->idle		= idle;
+	),
+
+	TP_printk("src_cpu=%d dst_cpu=%d src_llc=%d dst_llc=%d idle=%d",
+		  __entry->src_cpu, __entry->dst_cpu, __entry->src_llc,
+		  __entry->dst_llc, __entry->idle)
+);
+
+TRACE_EVENT(sched_select_task_rq,
+
+	TP_PROTO(int new_cpu, int old_cpu, int new_llc, int old_llc),
+
+	TP_ARGS(new_cpu, old_cpu, new_llc, old_llc),
+
+	TP_STRUCT__entry(
+		__field(	int,	new_cpu		)
+		__field(	int,	old_cpu		)
+		__field(	int,	new_llc		)
+		__field(	int,	old_llc		)
+	),
+
+	TP_fast_assign(
+		__entry->new_cpu	= new_cpu;
+		__entry->old_cpu	= old_cpu;
+		__entry->new_llc	= new_llc;
+		__entry->old_llc	= old_llc;
+	),
+
+	TP_printk("new_cpu=%d old_cpu=%d new_llc=%d old_llc=%d",
+		  __entry->new_cpu, __entry->old_cpu, __entry->new_llc, __entry->old_llc)
+);
+
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
  */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f74d8773c811..635fd3a6009c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8902,7 +8902,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 	struct sched_domain *tmp, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = prev_cpu;
+	int new_cpu = prev_cpu, orig_prev_cpu = prev_cpu;
 	int want_affine = 0;
 	/* SD_flags and WF_flags share the first nibble */
 	int sd_flag = wake_flags & 0xF;
@@ -8965,6 +8965,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 		new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 	}
 
+	trace_sched_select_task_rq(new_cpu, orig_prev_cpu,
+				   per_cpu(sd_llc_id, new_cpu),
+				   per_cpu(sd_llc_id, orig_prev_cpu));
+
 	return new_cpu;
 }
 
@@ -10026,11 +10030,17 @@ static int detach_tasks(struct lb_env *env)
 /*
  * attach_task() -- attach the task detached by detach_task() to its new rq.
  */
-static void attach_task(struct rq *rq, struct task_struct *p)
+static void attach_task(struct rq *rq, struct task_struct *p, struct lb_env *env)
 {
 	lockdep_assert_rq_held(rq);
 
 	WARN_ON_ONCE(task_rq(p) != rq);
+
+	if (env)
+		trace_sched_attach_task(env->src_cpu, env->dst_cpu,
+					per_cpu(sd_llc_id, env->src_cpu),
+					per_cpu(sd_llc_id, env->dst_cpu),
+					env->idle);
 	activate_task(rq, p, ENQUEUE_NOCLOCK);
 	wakeup_preempt(rq, p, 0);
 }
@@ -10039,13 +10049,13 @@ static void attach_task(struct rq *rq, struct task_struct *p)
  * attach_one_task() -- attaches the task returned from detach_one_task() to
  * its new rq.
  */
-static void attach_one_task(struct rq *rq, struct task_struct *p)
+static void attach_one_task(struct rq *rq, struct task_struct *p, struct lb_env *env)
 {
 	struct rq_flags rf;
 
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
-	attach_task(rq, p);
+	attach_task(rq, p, env);
 	rq_unlock(rq, &rf);
 }
 
@@ -10066,7 +10076,7 @@ static void attach_tasks(struct lb_env *env)
 		p = list_first_entry(tasks, struct task_struct, se.group_node);
 		list_del_init(&p->se.group_node);
 
-		attach_task(env->dst_rq, p);
+		attach_task(env->dst_rq, p, env);
 	}
 
 	rq_unlock(env->dst_rq, &rf);
@@ -12457,6 +12467,7 @@ static int active_load_balance_cpu_stop(void *data)
 	struct sched_domain *sd;
 	struct task_struct *p = NULL;
 	struct rq_flags rf;
+	struct lb_env env_tmp;
 
 	rq_lock_irq(busiest_rq, &rf);
 	/*
@@ -12512,6 +12523,7 @@ static int active_load_balance_cpu_stop(void *data)
 		} else {
 			schedstat_inc(sd->alb_failed);
 		}
+		memcpy(&env_tmp, &env, sizeof(env));
 	}
 	rcu_read_unlock();
 out_unlock:
@@ -12519,7 +12531,7 @@ static int active_load_balance_cpu_stop(void *data)
 	rq_unlock(busiest_rq, &rf);
 
 	if (p)
-		attach_one_task(target_rq, p);
+		attach_one_task(target_rq, p, sd ? &env_tmp : NULL);
 
 	local_irq_enable();
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated
  2025-04-21  3:25 ` [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated Chen Yu
@ 2025-04-24  9:22   ` Madadi Vineeth Reddy
  2025-04-24 14:11     ` Chen, Yu C
  0 siblings, 1 reply; 14+ messages in thread
From: Madadi Vineeth Reddy @ 2025-04-24  9:22 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Hillf Danton, linux-kernel,
	Madadi Vineeth Reddy

Hi Chen Yu,

On 21/04/25 08:55, Chen Yu wrote:
> It is found that when the process's preferred LLC gets saturated by too many
> threads, task contention is very frequent and causes performance regression.
> 
> Save the per LLC statistics calculated by periodic load balance. The statistics
> include the average utilization and the average number of runnable tasks.
> The task wakeup path for cache aware scheduling manipulates these statistics
> to inhibit cache aware scheduling to avoid performance regression. When either
> the average utilization of the preferred LLC has reached 25%, or the average
> number of runnable tasks has exceeded 1/3 of the LLC weight, the cache aware
> wakeup is disabled. Only when the process has more threads than the LLC weight
> will this restriction be enabled.
> 
> Running schbench via mmtests on a Xeon platform, which has 2 sockets, each socket
> has 60 Cores/120 CPUs. The DRAM interleave is enabled across NUMA nodes via BIOS,
> so there are 2 "LLCs" in 1 NUMA node.
> 
> compare-mmtests.pl --directory work/log --benchmark schbench --names baseline,sched_cache
>                                     baselin             sched_cach
>                                    baseline            sched_cache
> Lat 50.0th-qrtle-1          6.00 (   0.00%)        6.00 (   0.00%)
> Lat 90.0th-qrtle-1         10.00 (   0.00%)        9.00 (  10.00%)
> Lat 99.0th-qrtle-1         29.00 (   0.00%)       13.00 (  55.17%)
> Lat 99.9th-qrtle-1         35.00 (   0.00%)       21.00 (  40.00%)
> Lat 20.0th-qrtle-1        266.00 (   0.00%)      266.00 (   0.00%)
> Lat 50.0th-qrtle-2          8.00 (   0.00%)        6.00 (  25.00%)
> Lat 90.0th-qrtle-2         10.00 (   0.00%)       10.00 (   0.00%)
> Lat 99.0th-qrtle-2         19.00 (   0.00%)       18.00 (   5.26%)
> Lat 99.9th-qrtle-2         27.00 (   0.00%)       29.00 (  -7.41%)
> Lat 20.0th-qrtle-2        533.00 (   0.00%)      507.00 (   4.88%)
> Lat 50.0th-qrtle-4          6.00 (   0.00%)        5.00 (  16.67%)
> Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
> Lat 99.0th-qrtle-4         14.00 (   0.00%)        9.00 (  35.71%)
> Lat 99.9th-qrtle-4         22.00 (   0.00%)       14.00 (  36.36%)
> Lat 20.0th-qrtle-4       1070.00 (   0.00%)      995.00 (   7.01%)
> Lat 50.0th-qrtle-8          5.00 (   0.00%)        5.00 (   0.00%)
> Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
> Lat 99.0th-qrtle-8         12.00 (   0.00%)       11.00 (   8.33%)
> Lat 99.9th-qrtle-8         19.00 (   0.00%)       16.00 (  15.79%)
> Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2140.00 (   0.00%)
> Lat 50.0th-qrtle-16         6.00 (   0.00%)        5.00 (  16.67%)
> Lat 90.0th-qrtle-16         7.00 (   0.00%)        5.00 (  28.57%)
> Lat 99.0th-qrtle-16        12.00 (   0.00%)       10.00 (  16.67%)
> Lat 99.9th-qrtle-16        17.00 (   0.00%)       14.00 (  17.65%)
> Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4200.00 (   2.23%)
> Lat 50.0th-qrtle-32         6.00 (   0.00%)        5.00 (  16.67%)
> Lat 90.0th-qrtle-32         8.00 (   0.00%)        6.00 (  25.00%)
> Lat 99.0th-qrtle-32        12.00 (   0.00%)       10.00 (  16.67%)
> Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
> Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8528.00 (  -0.38%)
> Lat 50.0th-qrtle-64         6.00 (   0.00%)        5.00 (  16.67%)
> Lat 90.0th-qrtle-64         8.00 (   0.00%)        8.00 (   0.00%)
> Lat 99.0th-qrtle-64        12.00 (   0.00%)       12.00 (   0.00%)
> Lat 99.9th-qrtle-64        17.00 (   0.00%)       17.00 (   0.00%)
> Lat 20.0th-qrtle-64     17120.00 (   0.00%)    17120.00 (   0.00%)
> Lat 50.0th-qrtle-128        7.00 (   0.00%)        7.00 (   0.00%)
> Lat 90.0th-qrtle-128        9.00 (   0.00%)        9.00 (   0.00%)
> Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
> Lat 99.9th-qrtle-128       20.00 (   0.00%)       20.00 (   0.00%)
> Lat 20.0th-qrtle-128    31776.00 (   0.00%)    30496.00 (   4.03%)
> Lat 50.0th-qrtle-239        9.00 (   0.00%)        9.00 (   0.00%)
> Lat 90.0th-qrtle-239       14.00 (   0.00%)       18.00 ( -28.57%)
> Lat 99.0th-qrtle-239       43.00 (   0.00%)       56.00 ( -30.23%)
> Lat 99.9th-qrtle-239      106.00 (   0.00%)      483.00 (-355.66%)
> Lat 20.0th-qrtle-239    30176.00 (   0.00%)    29984.00 (   0.64%)
> 
> We can see overall latency improvement and some throughput degradation
> when the system gets saturated.
> 
> Also, we run schbench (old version) on an EPYC 7543 system, which has
> 4 NUMA nodes, and each node has 4 LLCs. Monitor the 99.0th latency:
> 
> case                    load            baseline(std%)  compare%( std%)
> normal                  4-mthreads-1-workers     1.00 (  6.47)   +9.02 (  4.68)
> normal                  4-mthreads-2-workers     1.00 (  3.25)  +28.03 (  8.76)
> normal                  4-mthreads-4-workers     1.00 (  6.67)   -4.32 (  2.58)
> normal                  4-mthreads-8-workers     1.00 (  2.38)   +1.27 (  2.41)
> normal                  4-mthreads-16-workers    1.00 (  5.61)   -8.48 (  4.39)
> normal                  4-mthreads-31-workers    1.00 (  9.31)   -0.22 (  9.77)
> 
> When the LLC is underloaded, the latency improvement is observed. When the LLC
> gets saturated, we observe some degradation.
> 

[..snip..]

> +static bool valid_target_cpu(int cpu, struct task_struct *p)
> +{
> +	int nr_running, llc_weight;
> +	unsigned long util, llc_cap;
> +
> +	if (!get_llc_stats(cpu, &nr_running, &llc_weight,
> +			   &util))
> +		return false;
> +
> +	llc_cap = llc_weight * SCHED_CAPACITY_SCALE;
> +
> +	/*
> +	 * If this process has many threads, be careful to avoid
> +	 * task stacking on the preferred LLC, by checking the system's
> +	 * utilization and runnable tasks. Otherwise, if this
> +	 * process does not have many threads, honor the cache
> +	 * aware wakeup.
> +	 */
> +	if (get_nr_threads(p) < llc_weight)
> +		return true;

IIUC, there might be scenarios were llc might be already overloaded with
threads of other process. In that case, we will be returning true for p in
above condition and don't check the below conditions. Shouldn't we check
the below two conditions either way?

Tested this patch with real life workload Daytrader, didn't see any regression.
It spawns lot of threads and is CPU intensive. So, I think it's not impacted
due to the below conditions.

Also, in schbench numbers provided by you, there is a degradation in saturated
case. Is it due to the overhead in computing the preferred llc which is not
being used due to below conditions?

Thanks,
Madadi Vineeth Reddy

> +
> +	/*
> +	 * Check if it exceeded 25% of average utiliazation,
> +	 * or if it exceeded 33% of CPUs. This is a magic number
> +	 * that did not cause heavy cache contention on Xeon or
> +	 * Zen.
> +	 */
> +	if (util * 4 >= llc_cap)
> +		return false;
> +
> +	if (nr_running * 3 >= llc_weight)
> +		return false;
> +
> +	return true;
> +}
> +

[..snip..]


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated
  2025-04-24  9:22   ` Madadi Vineeth Reddy
@ 2025-04-24 14:11     ` Chen, Yu C
  2025-04-24 15:51       ` Tim Chen
  2025-04-25  8:58       ` Madadi Vineeth Reddy
  0 siblings, 2 replies; 14+ messages in thread
From: Chen, Yu C @ 2025-04-24 14:11 UTC (permalink / raw)
  To: Madadi Vineeth Reddy
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Hillf Danton, linux-kernel, Len Brown,
	Chen Yu

Hi Madadi,

On 4/24/2025 5:22 PM, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> 
> On 21/04/25 08:55, Chen Yu wrote:
>> It is found that when the process's preferred LLC gets saturated by too many
>> threads, task contention is very frequent and causes performance regression.
>>
>> Save the per LLC statistics calculated by periodic load balance. The statistics
>> include the average utilization and the average number of runnable tasks.
>> The task wakeup path for cache aware scheduling manipulates these statistics
>> to inhibit cache aware scheduling to avoid performance regression. When either
>> the average utilization of the preferred LLC has reached 25%, or the average
>> number of runnable tasks has exceeded 1/3 of the LLC weight, the cache aware
>> wakeup is disabled. Only when the process has more threads than the LLC weight
>> will this restriction be enabled.
>>
>> Running schbench via mmtests on a Xeon platform, which has 2 sockets, each socket
>> has 60 Cores/120 CPUs. The DRAM interleave is enabled across NUMA nodes via BIOS,
>> so there are 2 "LLCs" in 1 NUMA node.
>>
>> compare-mmtests.pl --directory work/log --benchmark schbench --names baseline,sched_cache
>>                                      baselin             sched_cach
>>                                     baseline            sched_cache
>> Lat 50.0th-qrtle-1          6.00 (   0.00%)        6.00 (   0.00%)
>> Lat 90.0th-qrtle-1         10.00 (   0.00%)        9.00 (  10.00%)
>> Lat 99.0th-qrtle-1         29.00 (   0.00%)       13.00 (  55.17%)
>> Lat 99.9th-qrtle-1         35.00 (   0.00%)       21.00 (  40.00%)
>> Lat 20.0th-qrtle-1        266.00 (   0.00%)      266.00 (   0.00%)
>> Lat 50.0th-qrtle-2          8.00 (   0.00%)        6.00 (  25.00%)
>> Lat 90.0th-qrtle-2         10.00 (   0.00%)       10.00 (   0.00%)
>> Lat 99.0th-qrtle-2         19.00 (   0.00%)       18.00 (   5.26%)
>> Lat 99.9th-qrtle-2         27.00 (   0.00%)       29.00 (  -7.41%)
>> Lat 20.0th-qrtle-2        533.00 (   0.00%)      507.00 (   4.88%)
>> Lat 50.0th-qrtle-4          6.00 (   0.00%)        5.00 (  16.67%)
>> Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
>> Lat 99.0th-qrtle-4         14.00 (   0.00%)        9.00 (  35.71%)
>> Lat 99.9th-qrtle-4         22.00 (   0.00%)       14.00 (  36.36%)
>> Lat 20.0th-qrtle-4       1070.00 (   0.00%)      995.00 (   7.01%)
>> Lat 50.0th-qrtle-8          5.00 (   0.00%)        5.00 (   0.00%)
>> Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
>> Lat 99.0th-qrtle-8         12.00 (   0.00%)       11.00 (   8.33%)
>> Lat 99.9th-qrtle-8         19.00 (   0.00%)       16.00 (  15.79%)
>> Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2140.00 (   0.00%)
>> Lat 50.0th-qrtle-16         6.00 (   0.00%)        5.00 (  16.67%)
>> Lat 90.0th-qrtle-16         7.00 (   0.00%)        5.00 (  28.57%)
>> Lat 99.0th-qrtle-16        12.00 (   0.00%)       10.00 (  16.67%)
>> Lat 99.9th-qrtle-16        17.00 (   0.00%)       14.00 (  17.65%)
>> Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4200.00 (   2.23%)
>> Lat 50.0th-qrtle-32         6.00 (   0.00%)        5.00 (  16.67%)
>> Lat 90.0th-qrtle-32         8.00 (   0.00%)        6.00 (  25.00%)
>> Lat 99.0th-qrtle-32        12.00 (   0.00%)       10.00 (  16.67%)
>> Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
>> Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8528.00 (  -0.38%)
>> Lat 50.0th-qrtle-64         6.00 (   0.00%)        5.00 (  16.67%)
>> Lat 90.0th-qrtle-64         8.00 (   0.00%)        8.00 (   0.00%)
>> Lat 99.0th-qrtle-64        12.00 (   0.00%)       12.00 (   0.00%)
>> Lat 99.9th-qrtle-64        17.00 (   0.00%)       17.00 (   0.00%)
>> Lat 20.0th-qrtle-64     17120.00 (   0.00%)    17120.00 (   0.00%)
>> Lat 50.0th-qrtle-128        7.00 (   0.00%)        7.00 (   0.00%)
>> Lat 90.0th-qrtle-128        9.00 (   0.00%)        9.00 (   0.00%)
>> Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
>> Lat 99.9th-qrtle-128       20.00 (   0.00%)       20.00 (   0.00%)
>> Lat 20.0th-qrtle-128    31776.00 (   0.00%)    30496.00 (   4.03%)
>> Lat 50.0th-qrtle-239        9.00 (   0.00%)        9.00 (   0.00%)
>> Lat 90.0th-qrtle-239       14.00 (   0.00%)       18.00 ( -28.57%)
>> Lat 99.0th-qrtle-239       43.00 (   0.00%)       56.00 ( -30.23%)
>> Lat 99.9th-qrtle-239      106.00 (   0.00%)      483.00 (-355.66%)
>> Lat 20.0th-qrtle-239    30176.00 (   0.00%)    29984.00 (   0.64%)
>>
>> We can see overall latency improvement and some throughput degradation
>> when the system gets saturated.
>>
>> Also, we run schbench (old version) on an EPYC 7543 system, which has
>> 4 NUMA nodes, and each node has 4 LLCs. Monitor the 99.0th latency:
>>
>> case                    load            baseline(std%)  compare%( std%)
>> normal                  4-mthreads-1-workers     1.00 (  6.47)   +9.02 (  4.68)
>> normal                  4-mthreads-2-workers     1.00 (  3.25)  +28.03 (  8.76)
>> normal                  4-mthreads-4-workers     1.00 (  6.67)   -4.32 (  2.58)
>> normal                  4-mthreads-8-workers     1.00 (  2.38)   +1.27 (  2.41)
>> normal                  4-mthreads-16-workers    1.00 (  5.61)   -8.48 (  4.39)
>> normal                  4-mthreads-31-workers    1.00 (  9.31)   -0.22 (  9.77)
>>
>> When the LLC is underloaded, the latency improvement is observed. When the LLC
>> gets saturated, we observe some degradation.
>>
> 
> [..snip..]
> 
>> +static bool valid_target_cpu(int cpu, struct task_struct *p)
>> +{
>> +	int nr_running, llc_weight;
>> +	unsigned long util, llc_cap;
>> +
>> +	if (!get_llc_stats(cpu, &nr_running, &llc_weight,
>> +			   &util))
>> +		return false;
>> +
>> +	llc_cap = llc_weight * SCHED_CAPACITY_SCALE;
>> +
>> +	/*
>> +	 * If this process has many threads, be careful to avoid
>> +	 * task stacking on the preferred LLC, by checking the system's
>> +	 * utilization and runnable tasks. Otherwise, if this
>> +	 * process does not have many threads, honor the cache
>> +	 * aware wakeup.
>> +	 */
>> +	if (get_nr_threads(p) < llc_weight)
>> +		return true;
> 
> IIUC, there might be scenarios were llc might be already overloaded with
> threads of other process. In that case, we will be returning true for p in
> above condition and don't check the below conditions. Shouldn't we check
> the below two conditions either way?

The reason why get_nr_threads() was used is that we don't know if the 
following threshold is suitable for different workloads. We chose 25% 
and 33% because we found that it worked well for workload A, but was too 
low for workload B. Workload B requires the cache-aware scheduling to be 
enabled in any case, and the number of threads in B is smaller than the 
llc_weight. Therefore, we use the above check to meet the requirements 
of B. What you said is correct. We can remove the above checks on 
nr_thread and make the combination of utilization and nr_running a 
mandatory check, and then conduct further tuning.>
> Tested this patch with real life workload Daytrader, didn't see any regression.

Good to know the regression is gone.

> It spawns lot of threads and is CPU intensive. So, I think it's not impacted
> due to the below conditions.
> 
> Also, in schbench numbers provided by you, there is a degradation in saturated
> case. Is it due to the overhead in computing the preferred llc which is not
> being used due to below conditions?

Yes, the overhead of preferred LLC calculation could be one part, and we 
also suspect that the degradation might be tied to the task migrations. 
We still observed more task migrations than the baseline, even when the 
system was saturated (in theory, after 25% is exceeded, we should 
fallback to the generic task wakeup path). We haven't dug into that yet, 
and we can conduct an investigation in the following days.

thanks,
Chenyu>
> Thanks,
> Madadi Vineeth Reddy
> 
>> +
>> +	/*
>> +	 * Check if it exceeded 25% of average utiliazation,
>> +	 * or if it exceeded 33% of CPUs. This is a magic number
>> +	 * that did not cause heavy cache contention on Xeon or
>> +	 * Zen.
>> +	 */
>> +	if (util * 4 >= llc_cap)
>> +		return false;
>> +
>> +	if (nr_running * 3 >= llc_weight)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
> 
> [..snip..]
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated
  2025-04-24 14:11     ` Chen, Yu C
@ 2025-04-24 15:51       ` Tim Chen
  2025-04-25  9:13         ` Madadi Vineeth Reddy
  2025-04-25  8:58       ` Madadi Vineeth Reddy
  1 sibling, 1 reply; 14+ messages in thread
From: Tim Chen @ 2025-04-24 15:51 UTC (permalink / raw)
  To: Chen, Yu C, Madadi Vineeth Reddy
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Hillf Danton, linux-kernel, Len Brown,
	Chen Yu

On Thu, 2025-04-24 at 22:11 +0800, Chen, Yu C wrote:
> 
> > It spawns lot of threads and is CPU intensive. So, I think it's not impacted
> > due to the below conditions.
> > 
> > Also, in schbench numbers provided by you, there is a degradation in saturated
> > case. Is it due to the overhead in computing the preferred llc which is not
> > being used due to below conditions?
> 
> Yes, the overhead of preferred LLC calculation could be one part, and we 
> also suspect that the degradation might be tied to the task migrations. 
> We still observed more task migrations than the baseline, even when the 
> system was saturated (in theory, after 25% is exceeded, we should 
> fallback to the generic task wakeup path). We haven't dug into that yet, 
> and we can conduct an investigation in the following days.

In the saturation case it is mostly the tail latency that has regression.
The preferred LLC has a tendency to have higher load than the
other LLCs. Load balancer will try to move tasks out and wake balance will
try to move it back to the preferred LLC. This increases the task migrations
and affect tail latency.

Tim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated
  2025-04-24 14:11     ` Chen, Yu C
  2025-04-24 15:51       ` Tim Chen
@ 2025-04-25  8:58       ` Madadi Vineeth Reddy
  1 sibling, 0 replies; 14+ messages in thread
From: Madadi Vineeth Reddy @ 2025-04-25  8:58 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Gautham R . Shenoy,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Hillf Danton, linux-kernel, Len Brown,
	Chen Yu, Madadi Vineeth Reddy

On 24/04/25 19:41, Chen, Yu C wrote:
> Hi Madadi,
> 
> On 4/24/2025 5:22 PM, Madadi Vineeth Reddy wrote:
>> Hi Chen Yu,
>>
>> On 21/04/25 08:55, Chen Yu wrote:
>>> It is found that when the process's preferred LLC gets saturated by too many
>>> threads, task contention is very frequent and causes performance regression.
>>>
>>> Save the per LLC statistics calculated by periodic load balance. The statistics
>>> include the average utilization and the average number of runnable tasks.
>>> The task wakeup path for cache aware scheduling manipulates these statistics
>>> to inhibit cache aware scheduling to avoid performance regression. When either
>>> the average utilization of the preferred LLC has reached 25%, or the average
>>> number of runnable tasks has exceeded 1/3 of the LLC weight, the cache aware
>>> wakeup is disabled. Only when the process has more threads than the LLC weight
>>> will this restriction be enabled.
>>>
>>> Running schbench via mmtests on a Xeon platform, which has 2 sockets, each socket
>>> has 60 Cores/120 CPUs. The DRAM interleave is enabled across NUMA nodes via BIOS,
>>> so there are 2 "LLCs" in 1 NUMA node.
>>>
>>> compare-mmtests.pl --directory work/log --benchmark schbench --names baseline,sched_cache
>>>                                      baselin             sched_cach
>>>                                     baseline            sched_cache
>>> Lat 50.0th-qrtle-1          6.00 (   0.00%)        6.00 (   0.00%)
>>> Lat 90.0th-qrtle-1         10.00 (   0.00%)        9.00 (  10.00%)
>>> Lat 99.0th-qrtle-1         29.00 (   0.00%)       13.00 (  55.17%)
>>> Lat 99.9th-qrtle-1         35.00 (   0.00%)       21.00 (  40.00%)
>>> Lat 20.0th-qrtle-1        266.00 (   0.00%)      266.00 (   0.00%)
>>> Lat 50.0th-qrtle-2          8.00 (   0.00%)        6.00 (  25.00%)
>>> Lat 90.0th-qrtle-2         10.00 (   0.00%)       10.00 (   0.00%)
>>> Lat 99.0th-qrtle-2         19.00 (   0.00%)       18.00 (   5.26%)
>>> Lat 99.9th-qrtle-2         27.00 (   0.00%)       29.00 (  -7.41%)
>>> Lat 20.0th-qrtle-2        533.00 (   0.00%)      507.00 (   4.88%)
>>> Lat 50.0th-qrtle-4          6.00 (   0.00%)        5.00 (  16.67%)
>>> Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
>>> Lat 99.0th-qrtle-4         14.00 (   0.00%)        9.00 (  35.71%)
>>> Lat 99.9th-qrtle-4         22.00 (   0.00%)       14.00 (  36.36%)
>>> Lat 20.0th-qrtle-4       1070.00 (   0.00%)      995.00 (   7.01%)
>>> Lat 50.0th-qrtle-8          5.00 (   0.00%)        5.00 (   0.00%)
>>> Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
>>> Lat 99.0th-qrtle-8         12.00 (   0.00%)       11.00 (   8.33%)
>>> Lat 99.9th-qrtle-8         19.00 (   0.00%)       16.00 (  15.79%)
>>> Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2140.00 (   0.00%)
>>> Lat 50.0th-qrtle-16         6.00 (   0.00%)        5.00 (  16.67%)
>>> Lat 90.0th-qrtle-16         7.00 (   0.00%)        5.00 (  28.57%)
>>> Lat 99.0th-qrtle-16        12.00 (   0.00%)       10.00 (  16.67%)
>>> Lat 99.9th-qrtle-16        17.00 (   0.00%)       14.00 (  17.65%)
>>> Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4200.00 (   2.23%)
>>> Lat 50.0th-qrtle-32         6.00 (   0.00%)        5.00 (  16.67%)
>>> Lat 90.0th-qrtle-32         8.00 (   0.00%)        6.00 (  25.00%)
>>> Lat 99.0th-qrtle-32        12.00 (   0.00%)       10.00 (  16.67%)
>>> Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
>>> Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8528.00 (  -0.38%)
>>> Lat 50.0th-qrtle-64         6.00 (   0.00%)        5.00 (  16.67%)
>>> Lat 90.0th-qrtle-64         8.00 (   0.00%)        8.00 (   0.00%)
>>> Lat 99.0th-qrtle-64        12.00 (   0.00%)       12.00 (   0.00%)
>>> Lat 99.9th-qrtle-64        17.00 (   0.00%)       17.00 (   0.00%)
>>> Lat 20.0th-qrtle-64     17120.00 (   0.00%)    17120.00 (   0.00%)
>>> Lat 50.0th-qrtle-128        7.00 (   0.00%)        7.00 (   0.00%)
>>> Lat 90.0th-qrtle-128        9.00 (   0.00%)        9.00 (   0.00%)
>>> Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
>>> Lat 99.9th-qrtle-128       20.00 (   0.00%)       20.00 (   0.00%)
>>> Lat 20.0th-qrtle-128    31776.00 (   0.00%)    30496.00 (   4.03%)
>>> Lat 50.0th-qrtle-239        9.00 (   0.00%)        9.00 (   0.00%)
>>> Lat 90.0th-qrtle-239       14.00 (   0.00%)       18.00 ( -28.57%)
>>> Lat 99.0th-qrtle-239       43.00 (   0.00%)       56.00 ( -30.23%)
>>> Lat 99.9th-qrtle-239      106.00 (   0.00%)      483.00 (-355.66%)
>>> Lat 20.0th-qrtle-239    30176.00 (   0.00%)    29984.00 (   0.64%)
>>>
>>> We can see overall latency improvement and some throughput degradation
>>> when the system gets saturated.
>>>
>>> Also, we run schbench (old version) on an EPYC 7543 system, which has
>>> 4 NUMA nodes, and each node has 4 LLCs. Monitor the 99.0th latency:
>>>
>>> case                    load            baseline(std%)  compare%( std%)
>>> normal                  4-mthreads-1-workers     1.00 (  6.47)   +9.02 (  4.68)
>>> normal                  4-mthreads-2-workers     1.00 (  3.25)  +28.03 (  8.76)
>>> normal                  4-mthreads-4-workers     1.00 (  6.67)   -4.32 (  2.58)
>>> normal                  4-mthreads-8-workers     1.00 (  2.38)   +1.27 (  2.41)
>>> normal                  4-mthreads-16-workers    1.00 (  5.61)   -8.48 (  4.39)
>>> normal                  4-mthreads-31-workers    1.00 (  9.31)   -0.22 (  9.77)
>>>
>>> When the LLC is underloaded, the latency improvement is observed. When the LLC
>>> gets saturated, we observe some degradation.
>>>
>>
>> [..snip..]
>>
>>> +static bool valid_target_cpu(int cpu, struct task_struct *p)
>>> +{
>>> +    int nr_running, llc_weight;
>>> +    unsigned long util, llc_cap;
>>> +
>>> +    if (!get_llc_stats(cpu, &nr_running, &llc_weight,
>>> +               &util))
>>> +        return false;
>>> +
>>> +    llc_cap = llc_weight * SCHED_CAPACITY_SCALE;
>>> +
>>> +    /*
>>> +     * If this process has many threads, be careful to avoid
>>> +     * task stacking on the preferred LLC, by checking the system's
>>> +     * utilization and runnable tasks. Otherwise, if this
>>> +     * process does not have many threads, honor the cache
>>> +     * aware wakeup.
>>> +     */
>>> +    if (get_nr_threads(p) < llc_weight)
>>> +        return true;
>>
>> IIUC, there might be scenarios were llc might be already overloaded with
>> threads of other process. In that case, we will be returning true for p in
>> above condition and don't check the below conditions. Shouldn't we check
>> the below two conditions either way?
> 
> The reason why get_nr_threads() was used is that we don't know if the following threshold is suitable for different workloads. We chose 25% and 33% because we found that it worked well for workload A, but was too low for workload B. Workload B requires the cache-aware scheduling to be enabled in any case, and the number of threads in B is smaller than the llc_weight. Therefore, we use the above check to meet the requirements of B. What you said is correct. We can remove the above checks on nr_thread and make the combination of utilization and nr_running a mandatory check, and then conduct further tuning.>

Thanks Chen. It's always tricky to make all workloads happy. As long as
we're not regressing too much on the others, it should be fine I guess given
the overall impact is positive.

JFYI, In Power10, LLC is at a small core level containing 4 threads. So,
nr_running on LLC can't be more than 1 for cache aware scheduling to work.

>> Tested this patch with real life workload Daytrader, didn't see any regression.
> 
> Good to know the regression is gone.
> 
>> It spawns lot of threads and is CPU intensive. So, I think it's not impacted
>> due to the below conditions.
>>
>> Also, in schbench numbers provided by you, there is a degradation in saturated
>> case. Is it due to the overhead in computing the preferred llc which is not
>> being used due to below conditions?
> 
> Yes, the overhead of preferred LLC calculation could be one part, and we also suspect that the degradation might be tied to the task migrations. We still observed more task migrations than the baseline, even when the system was saturated (in theory, after 25% is exceeded, we should fallback to the generic task wakeup path). We haven't dug into that yet, and we can conduct an investigation in the following days.

Interesting. I will also try to look into these extra migrations.

Thanks,
Madadi Vineeth Reddy

> 
> thanks,
> Chenyu>
>> Thanks,
>> Madadi Vineeth Reddy
>>
>>> +
>>> +    /*
>>> +     * Check if it exceeded 25% of average utiliazation,
>>> +     * or if it exceeded 33% of CPUs. This is a magic number
>>> +     * that did not cause heavy cache contention on Xeon or
>>> +     * Zen.
>>> +     */
>>> +    if (util * 4 >= llc_cap)
>>> +        return false;
>>> +
>>> +    if (nr_running * 3 >= llc_weight)
>>> +        return false;
>>> +
>>> +    return true;
>>> +}
>>> +
>>
>> [..snip..]
>>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated
  2025-04-24 15:51       ` Tim Chen
@ 2025-04-25  9:13         ` Madadi Vineeth Reddy
  2025-04-25 17:29           ` Tim Chen
  0 siblings, 1 reply; 14+ messages in thread
From: Madadi Vineeth Reddy @ 2025-04-25  9:13 UTC (permalink / raw)
  To: Tim Chen
  Cc: Chen, Yu C, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Hillf Danton, linux-kernel,
	Len Brown, Chen Yu, Madadi Vineeth Reddy

Hi Tim,

On 24/04/25 21:21, Tim Chen wrote:
> On Thu, 2025-04-24 at 22:11 +0800, Chen, Yu C wrote:
>>
>>> It spawns lot of threads and is CPU intensive. So, I think it's not impacted
>>> due to the below conditions.
>>>
>>> Also, in schbench numbers provided by you, there is a degradation in saturated
>>> case. Is it due to the overhead in computing the preferred llc which is not
>>> being used due to below conditions?
>>
>> Yes, the overhead of preferred LLC calculation could be one part, and we 
>> also suspect that the degradation might be tied to the task migrations. 
>> We still observed more task migrations than the baseline, even when the 
>> system was saturated (in theory, after 25% is exceeded, we should 
>> fallback to the generic task wakeup path). We haven't dug into that yet, 
>> and we can conduct an investigation in the following days.
> 
> In the saturation case it is mostly the tail latency that has regression.
> The preferred LLC has a tendency to have higher load than the
> other LLCs. Load balancer will try to move tasks out and wake balance will
> try to move it back to the preferred LLC. This increases the task migrations
> and affect tail latency.

Why would the task be moved back to the preferred LLC in wakeup path for the
saturated case? The checks shouldn't allow it right?

Thanks,
Madadi Vineeth Reddy

> 
> Tim


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated
  2025-04-25  9:13         ` Madadi Vineeth Reddy
@ 2025-04-25 17:29           ` Tim Chen
  0 siblings, 0 replies; 14+ messages in thread
From: Tim Chen @ 2025-04-25 17:29 UTC (permalink / raw)
  To: e45141a1a64d7dcfca2683f56735ba4da60ba19e.camel
  Cc: Chen, Yu C, Peter Zijlstra, Ingo Molnar, K Prateek Nayak,
	Gautham R . Shenoy, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Tim Chen,
	Vincent Guittot, Libo Chen, Abel Wu, Hillf Danton, linux-kernel,
	Len Brown, Chen Yu, Madadi Vineeth Reddy

On Fri, 2025-04-25 at 14:43 +0530, Madadi Vineeth Reddy wrote:
> Hi Tim,
> 
> On 24/04/25 21:21, Tim Chen wrote:
> > On Thu, 2025-04-24 at 22:11 +0800, Chen, Yu C wrote:
> > > 
> > > > It spawns lot of threads and is CPU intensive. So, I think it's not impacted
> > > > due to the below conditions.
> > > > 
> > > > Also, in schbench numbers provided by you, there is a degradation in saturated
> > > > case. Is it due to the overhead in computing the preferred llc which is not
> > > > being used due to below conditions?
> > > 
> > > Yes, the overhead of preferred LLC calculation could be one part, and we 
> > > also suspect that the degradation might be tied to the task migrations. 
> > > We still observed more task migrations than the baseline, even when the 
> > > system was saturated (in theory, after 25% is exceeded, we should 
> > > fallback to the generic task wakeup path). We haven't dug into that yet, 
> > > and we can conduct an investigation in the following days.
> > 
> > In the saturation case it is mostly the tail latency that has regression.
> > The preferred LLC has a tendency to have higher load than the
> > other LLCs. Load balancer will try to move tasks out and wake balance will
> > try to move it back to the preferred LLC. This increases the task migrations
> > and affect tail latency.
> 
> Why would the task be moved back to the preferred LLC in wakeup path for the
> saturated case? The checks shouldn't allow it right?

The task wake ups happens very frequently in schbench and it takes a while for utilization to catch
up. The utilization of the LLC is updated at the load balance time of LLC. 

So once utilization falls below the utilization threshold, there is a window
where the woken tasks will rush into the preferred LLC until the utilization
is updated at the next load balance time. 

Tim


> 
> Thanks,
> Madadi Vineeth Reddy
> 
> > 
> > Tim
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/5] sched: Introduce Cache aware scheduling
  2025-04-21  3:23 [RFC PATCH 0/5] sched: Introduce Cache aware scheduling Chen Yu
                   ` (4 preceding siblings ...)
  2025-04-21  3:25 ` [RFC PATCH 5/5] sched: Add ftrace to track task migration and load balance within and across LLC Chen Yu
@ 2025-04-29  3:47 ` K Prateek Nayak
  2025-04-29 12:57   ` Chen, Yu C
  5 siblings, 1 reply; 14+ messages in thread
From: K Prateek Nayak @ 2025-04-29  3:47 UTC (permalink / raw)
  To: Chen Yu, Peter Zijlstra, Ingo Molnar, Gautham R . Shenoy
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton,
	linux-kernel

Hello Chenyu,

On 4/21/2025 8:53 AM, Chen Yu wrote:
> This is a respin of the cache-aware scheduling proposed by Peter[1].
> In this patch set, some known issues in [1] were addressed, and the performance
> regression was investigated and mitigated.
> 
> Cache-aware scheduling aims to aggregate tasks with potential shared resources
> into the same cache domain. This approach enhances cache locality, thereby optimizing
> system performance by reducing cache misses and improving data access efficiency.
> 
> In the current implementation, threads within the same process are considered as
> entities that potentially share resources. Cache-aware scheduling monitors the CPU
> occupancy of each cache domain for every process. Based on this monitoring, it endeavors
> to migrate threads within a given process to its cache-hot domains, with the goal of
> maximizing cache locality.
> 
> Patch 1 constitutes the fundamental cache-aware scheduling. It is the same patch as [1].
> Patch 2 comprises a series of fixes for Patch 1, including compiling warnings and functional
> fixes.
> Patch 3 fixes performance degradation that arise from excessive task migrations within the
> preferred LLC domain.
> Patch 4 further alleviates performance regressions when the preferred LLC becomes saturated.
> Patch 5 introduces ftrace events, which is used to track task migrations triggered by wakeup
> and load balancer. This addition facilitate performance regression analysis.
> 
> The patch set is applied on top of v6.14 sched/core,
> commit 4ba7518327c6 ("sched/debug: Print the local group's asym_prefer_cpu")
> 

Thank you for working on this! I have been a bit preoccupied but I
promise to look into the regressions I've reported below sometime
this week and report back soon on what seems to make them unhappy.

tl;dr

o Most regressions aren't as severe as v1 thanks to all the work
   from you and Abel.

o I too see schbench regress in fully loaded cases but the old
   schbench tail latencies improve when #threads < #CPUs in LLC

o There is a consistent regression in tbench - what I presume is
   happening there is all threads of "tbench_srv" share an mm and
   and all the tbench clients share an mm but for best performance,
   the wakeups between client and server must be local (same core /
   same LLC) but either the cost of additional search build up or
   the clients get co-located as one set of entities and the
   servers get colocated as another set of entities leading to
   mostly remote wakeups.

   Not too sure if netperf has similar architecture as tbench but
   that too sees a regression.

o Longer running benchmarks see a regression. Still not sure if
   this is because of additional search or something else.

I'll leave the full results below:

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Benchmark results

   ==================================================================
   Test          : hackbench
   Units         : Normalized time in seconds
   Interpretation: Lower is better
   Statistic     : AMean
   ==================================================================
   Case:           tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
    1-groups     1.00 [ -0.00]( 9.02)     1.03 [ -3.38](11.44)
    2-groups     1.00 [ -0.00]( 6.86)     0.98 [  2.20]( 6.61)
    4-groups     1.00 [ -0.00]( 2.73)     1.00 [  0.42]( 4.00)
    8-groups     1.00 [ -0.00]( 1.21)     1.04 [ -4.00]( 5.59)
   16-groups     1.00 [ -0.00]( 0.97)     1.01 [ -0.52]( 2.12)


   ==================================================================
   Test          : tbench
   Units         : Normalized throughput
   Interpretation: Higher is better
   Statistic     : AMean
   ==================================================================
   Clients:    tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
       1     1.00 [  0.00]( 0.67)     0.96 [ -3.95]( 0.55)
       2     1.00 [  0.00]( 0.85)     0.98 [ -1.69]( 0.65)
       4     1.00 [  0.00]( 0.52)     0.96 [ -3.68]( 0.09)
       8     1.00 [  0.00]( 0.92)     0.96 [ -4.06]( 0.43)
      16     1.00 [  0.00]( 1.01)     0.95 [ -5.19]( 1.65)
      32     1.00 [  0.00]( 1.35)     0.95 [ -4.79]( 0.29)
      64     1.00 [  0.00]( 1.22)     0.94 [ -6.49]( 1.46)
     128     1.00 [  0.00]( 2.39)     0.92 [ -7.61]( 1.41)
     256     1.00 [  0.00]( 1.83)     0.92 [ -8.24]( 0.35)
     512     1.00 [  0.00]( 0.17)     0.93 [ -7.08]( 0.22)
    1024     1.00 [  0.00]( 0.31)     0.91 [ -8.57]( 0.29)


   ==================================================================
   Test          : stream-10
   Units         : Normalized Bandwidth, MB/s
   Interpretation: Higher is better
   Statistic     : HMean
   ==================================================================
   Test:       tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
    Copy     1.00 [  0.00]( 8.24)     1.03 [  2.66]( 6.15)
   Scale     1.00 [  0.00]( 5.62)     0.99 [ -1.43]( 6.32)
     Add     1.00 [  0.00]( 6.18)     0.97 [ -3.12]( 5.70)
   Triad     1.00 [  0.00]( 5.29)     1.01 [  1.31]( 3.82)


   ==================================================================
   Test          : stream-100
   Units         : Normalized Bandwidth, MB/s
   Interpretation: Higher is better
   Statistic     : HMean
   ==================================================================
   Test:       tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
    Copy     1.00 [  0.00]( 2.92)     0.99 [ -1.47]( 5.02)
   Scale     1.00 [  0.00]( 4.80)     0.98 [ -2.08]( 5.53)
     Add     1.00 [  0.00]( 4.35)     0.98 [ -1.85]( 4.26)
   Triad     1.00 [  0.00]( 2.30)     0.99 [ -0.84]( 1.83)


   ==================================================================
   Test          : netperf
   Units         : Normalized Througput
   Interpretation: Higher is better
   Statistic     : AMean
   ==================================================================
   Clients:         tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
    1-clients     1.00 [  0.00]( 0.17)     0.97 [ -2.55]( 0.50)
    2-clients     1.00 [  0.00]( 0.77)     0.97 [ -2.52]( 0.20)
    4-clients     1.00 [  0.00]( 0.93)     0.97 [ -3.30]( 0.54)
    8-clients     1.00 [  0.00]( 0.87)     0.96 [ -3.98]( 1.19)
   16-clients     1.00 [  0.00]( 1.15)     0.96 [ -4.16]( 1.06)
   32-clients     1.00 [  0.00]( 1.00)     0.95 [ -5.47]( 0.96)
   64-clients     1.00 [  0.00]( 1.37)     0.94 [ -5.75]( 1.64)
   128-clients    1.00 [  0.00]( 0.99)     0.92 [ -8.50]( 1.49)
   256-clients    1.00 [  0.00]( 3.23)     0.90 [-10.22]( 2.86)
   512-clients    1.00 [  0.00](58.43)     0.90 [-10.28](47.59)


   ==================================================================
   Test          : schbench
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
     1     1.00 [ -0.00]( 5.59)     0.55 [ 45.00](11.17)
     2     1.00 [ -0.00](14.29)     0.52 [ 47.62]( 7.53)
     4     1.00 [ -0.00]( 1.24)     0.57 [ 42.55]( 5.73)
     8     1.00 [ -0.00](11.16)     1.06 [ -6.12]( 2.92)
    16     1.00 [ -0.00]( 6.81)     1.12 [-12.28](11.09)
    32     1.00 [ -0.00]( 6.99)     1.05 [ -5.26](12.48)
    64     1.00 [ -0.00]( 6.00)     0.96 [  4.21](18.31)
   128     1.00 [ -0.00]( 3.26)     1.63 [-62.84](36.71)
   256     1.00 [ -0.00](19.29)     0.97 [  3.25]( 4.94)
   512     1.00 [ -0.00]( 1.48)     1.05 [ -4.71]( 5.11)


   ==================================================================
   Test          : new-schbench-requests-per-second
   Units         : Normalized Requests per second
   Interpretation: Higher is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
     1     1.00 [  0.00]( 0.00)     0.95 [ -4.99]( 0.48)
     2     1.00 [  0.00]( 0.26)     0.96 [ -3.82]( 0.55)
     4     1.00 [  0.00]( 0.15)     0.95 [ -4.96]( 0.27)
     8     1.00 [  0.00]( 0.15)     0.99 [ -0.58]( 0.00)
    16     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)
    32     1.00 [  0.00]( 4.88)     1.04 [  4.27]( 2.42)
    64     1.00 [  0.00]( 5.57)     0.87 [-13.10](11.51)
   128     1.00 [  0.00]( 0.34)     0.97 [ -3.13]( 0.58)
   256     1.00 [  0.00]( 1.95)     1.02 [  1.83]( 0.15)
   512     1.00 [  0.00]( 0.44)     1.00 [  0.48]( 0.12)


   ==================================================================
   Test          : new-schbench-wakeup-latency
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
     1     1.00 [ -0.00]( 4.19)     1.00 [ -0.00](14.91)
     2     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 0.00)
     4     1.00 [ -0.00]( 8.91)     0.80 [ 20.00]( 4.43)
     8     1.00 [ -0.00]( 7.45)     1.00 [ -0.00]( 7.45)
    16     1.00 [ -0.00]( 4.08)     1.00 [ -0.00](10.79)
    32     1.00 [ -0.00](16.90)     0.93 [  6.67](10.00)
    64     1.00 [ -0.00]( 9.11)     1.12 [-12.50]( 0.00)
   128     1.00 [ -0.00]( 7.05)     2.43 [-142.86](24.47)
   256     1.00 [ -0.00]( 4.32)     1.02 [ -2.34]( 1.20)
   512     1.00 [ -0.00]( 0.35)     1.01 [ -0.77]( 0.40)


   ==================================================================
   Test          : new-schbench-request-latency
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
     1     1.00 [ -0.00]( 0.78)     1.16 [-15.70]( 2.14)
     2     1.00 [ -0.00]( 0.81)     1.13 [-13.11]( 0.62)
     4     1.00 [ -0.00]( 0.24)     1.26 [-26.11](16.43)
     8     1.00 [ -0.00]( 1.30)     1.03 [ -3.46]( 0.81)
    16     1.00 [ -0.00]( 1.11)     1.02 [ -2.12]( 1.85)
    32     1.00 [ -0.00]( 5.94)     0.96 [  4.05]( 4.48)
    64     1.00 [ -0.00]( 6.27)     1.06 [ -6.01]( 6.67)
   128     1.00 [ -0.00]( 0.21)     1.12 [-12.31]( 2.61)
   256     1.00 [ -0.00](13.73)     1.06 [ -6.30]( 3.37)
   512     1.00 [ -0.00]( 0.95)     1.05 [ -4.85]( 0.61)


   ==================================================================
   Test          : Various longer running benchmarks
   Units         : %diff in throughput reported
   Interpretation: Higher is better
   Statistic     : Median
   ==================================================================
   Benchmarks:                 %diff
   ycsb-cassandra              -1.21%
   ycsb-mongodb                -0.69%

   deathstarbench-1x           -7.40%
   deathstarbench-2x           -3.80%
   deathstarbench-3x           -3.99%
   deathstarbench-6x           -3.02%

   hammerdb+mysql 16VU         -2.59%
   hammerdb+mysql 64VU         -1.05%


Also, could you fold the below diff into your Patch2:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb5a2572b4f8..6c51dd2b7b32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7694,8 +7694,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
  	struct sched_domain_shared *sd_share;
  
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-
  	if (sched_feat(SIS_UTIL)) {
  		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
  		if (sd_share) {
@@ -7707,6 +7705,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
  		}
  	}
  
+	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
  	if (static_branch_unlikely(&sched_cluster_active)) {
  		struct sched_group *sg = sd->groups;
  
---

If the SIS_UTIL cut off hits, that result of the cpumask_and() is of no
use. To save some additional cycles, especially in cases where we target
the LLC frequently and the search bails out because the LLC is busy,
this overhead can be easily avoided. Since select_idle_cpu() can now be
called twice per wakeup, this overhead can be visible in benchmarks like
hackbench.

-- 
Thanks and Regards,
Prateek

> schbench was tested on EMR and Zen3 Milan. An improvement in tail latency was observed when
> the LLC was underloaded; however, some regressions were still evident when the LLC was
> saturated. Additionally, the load balance should be adjusted to further address these
> regressions.
> 
> [1] https://lore.kernel.org/all/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> 
> 
> Chen Yu (4):
>    sched: Several fixes for cache aware scheduling
>    sched: Avoid task migration within its preferred LLC
>    sched: Inhibit cache aware scheduling if the preferred LLC is over
>      aggregated
>    sched: Add ftrace to track task migration and load balance within and
>      across LLC
> 
> Peter Zijlstra (1):
>    sched: Cache aware load-balancing
> 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 0/5] sched: Introduce Cache aware scheduling
  2025-04-29  3:47 ` [RFC PATCH 0/5] sched: Introduce Cache aware scheduling K Prateek Nayak
@ 2025-04-29 12:57   ` Chen, Yu C
  0 siblings, 0 replies; 14+ messages in thread
From: Chen, Yu C @ 2025-04-29 12:57 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Tim Chen, Vincent Guittot,
	Libo Chen, Abel Wu, Madadi Vineeth Reddy, Hillf Danton,
	linux-kernel, Peter Zijlstra, Gautham R . Shenoy, Ingo Molnar,
	Len Brown

Hi Prateek,

On 4/29/2025 11:47 AM, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 4/21/2025 8:53 AM, Chen Yu wrote:
>> This is a respin of the cache-aware scheduling proposed by Peter[1].
>> In this patch set, some known issues in [1] were addressed, and the 
>> performance
>> regression was investigated and mitigated.
>>
>> Cache-aware scheduling aims to aggregate tasks with potential shared 
>> resources
>> into the same cache domain. This approach enhances cache locality, 
>> thereby optimizing
>> system performance by reducing cache misses and improving data access 
>> efficiency.
>>
>> In the current implementation, threads within the same process are 
>> considered as
>> entities that potentially share resources. Cache-aware scheduling 
>> monitors the CPU
>> occupancy of each cache domain for every process. Based on this 
>> monitoring, it endeavors
>> to migrate threads within a given process to its cache-hot domains, 
>> with the goal of
>> maximizing cache locality.
>>
>> Patch 1 constitutes the fundamental cache-aware scheduling. It is the 
>> same patch as [1].
>> Patch 2 comprises a series of fixes for Patch 1, including compiling 
>> warnings and functional
>> fixes.
>> Patch 3 fixes performance degradation that arise from excessive task 
>> migrations within the
>> preferred LLC domain.
>> Patch 4 further alleviates performance regressions when the preferred 
>> LLC becomes saturated.
>> Patch 5 introduces ftrace events, which is used to track task 
>> migrations triggered by wakeup
>> and load balancer. This addition facilitate performance regression 
>> analysis.
>>
>> The patch set is applied on top of v6.14 sched/core,
>> commit 4ba7518327c6 ("sched/debug: Print the local group's 
>> asym_prefer_cpu")
>>
> 
> Thank you for working on this! I have been a bit preoccupied but I
> promise to look into the regressions I've reported below sometime
> this week and report back soon on what seems to make them unhappy.
> 

Thanks for your time on this testings.

> tl;dr
> 
> o Most regressions aren't as severe as v1 thanks to all the work
>    from you and Abel.
> 
> o I too see schbench regress in fully loaded cases but the old
>    schbench tail latencies improve when #threads < #CPUs in LLC
> 
> o There is a consistent regression in tbench - what I presume is
>    happening there is all threads of "tbench_srv" share an mm and
>    and all the tbench clients share an mm but for best performance,
>    the wakeups between client and server must be local (same core /
>    same LLC) but either the cost of additional search build up or
>    the clients get co-located as one set of entities and the
>    servers get colocated as another set of entities leading to
>    mostly remote wakeups.

This is a good point. If A and B are both multi-threaded processes,
and A interacts with B frequently, we should not only consider
aggregating the threads within A and B, but also placing A and
B together. I'm not sure if WF_SYNC is carried along and takes
effect during the tbench socket wakeup process. I'll also try
tbench/netperf testings.

> 
>    Not too sure if netperf has similar architecture as tbench but
>    that too sees a regression.
> 
> o Longer running benchmarks see a regression. Still not sure if
>    this is because of additional search or something else.
> 
> I'll leave the full results below:
> 
> o Machine details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)

> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> 
> o Benchmark results
> 
>    ==================================================================
>    Test          : hackbench
>    Units         : Normalized time in seconds
>    Interpretation: Lower is better
>    Statistic     : AMean
>    ==================================================================
>    Case:           tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>     1-groups     1.00 [ -0.00]( 9.02)     1.03 [ -3.38](11.44)
>     2-groups     1.00 [ -0.00]( 6.86)     0.98 [  2.20]( 6.61)
>     4-groups     1.00 [ -0.00]( 2.73)     1.00 [  0.42]( 4.00)
>     8-groups     1.00 [ -0.00]( 1.21)     1.04 [ -4.00]( 5.59)
>    16-groups     1.00 [ -0.00]( 0.97)     1.01 [ -0.52]( 2.12)
> 
> 
>    ==================================================================
>    Test          : tbench
>    Units         : Normalized throughput
>    Interpretation: Higher is better
>    Statistic     : AMean
>    ==================================================================
>    Clients:    tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>        1     1.00 [  0.00]( 0.67)     0.96 [ -3.95]( 0.55)
>        2     1.00 [  0.00]( 0.85)     0.98 [ -1.69]( 0.65)
>        4     1.00 [  0.00]( 0.52)     0.96 [ -3.68]( 0.09)
>        8     1.00 [  0.00]( 0.92)     0.96 [ -4.06]( 0.43)
>       16     1.00 [  0.00]( 1.01)     0.95 [ -5.19]( 1.65)
>       32     1.00 [  0.00]( 1.35)     0.95 [ -4.79]( 0.29)
>       64     1.00 [  0.00]( 1.22)     0.94 [ -6.49]( 1.46)
>      128     1.00 [  0.00]( 2.39)     0.92 [ -7.61]( 1.41)
>      256     1.00 [  0.00]( 1.83)     0.92 [ -8.24]( 0.35)
>      512     1.00 [  0.00]( 0.17)     0.93 [ -7.08]( 0.22)
>     1024     1.00 [  0.00]( 0.31)     0.91 [ -8.57]( 0.29)
> 
> 
>    ==================================================================
>    Test          : stream-10
>    Units         : Normalized Bandwidth, MB/s
>    Interpretation: Higher is better
>    Statistic     : HMean
>    ==================================================================
>    Test:       tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>     Copy     1.00 [  0.00]( 8.24)     1.03 [  2.66]( 6.15)
>    Scale     1.00 [  0.00]( 5.62)     0.99 [ -1.43]( 6.32)
>      Add     1.00 [  0.00]( 6.18)     0.97 [ -3.12]( 5.70)
>    Triad     1.00 [  0.00]( 5.29)     1.01 [  1.31]( 3.82)
> 
> 
>    ==================================================================
>    Test          : stream-100
>    Units         : Normalized Bandwidth, MB/s
>    Interpretation: Higher is better
>    Statistic     : HMean
>    ==================================================================
>    Test:       tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>     Copy     1.00 [  0.00]( 2.92)     0.99 [ -1.47]( 5.02)
>    Scale     1.00 [  0.00]( 4.80)     0.98 [ -2.08]( 5.53)
>      Add     1.00 [  0.00]( 4.35)     0.98 [ -1.85]( 4.26)
>    Triad     1.00 [  0.00]( 2.30)     0.99 [ -0.84]( 1.83)
> 
> 
>    ==================================================================
>    Test          : netperf
>    Units         : Normalized Througput
>    Interpretation: Higher is better
>    Statistic     : AMean
>    ==================================================================
>    Clients:         tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>     1-clients     1.00 [  0.00]( 0.17)     0.97 [ -2.55]( 0.50)
>     2-clients     1.00 [  0.00]( 0.77)     0.97 [ -2.52]( 0.20)
>     4-clients     1.00 [  0.00]( 0.93)     0.97 [ -3.30]( 0.54)
>     8-clients     1.00 [  0.00]( 0.87)     0.96 [ -3.98]( 1.19)
>    16-clients     1.00 [  0.00]( 1.15)     0.96 [ -4.16]( 1.06)
>    32-clients     1.00 [  0.00]( 1.00)     0.95 [ -5.47]( 0.96)
>    64-clients     1.00 [  0.00]( 1.37)     0.94 [ -5.75]( 1.64)
>    128-clients    1.00 [  0.00]( 0.99)     0.92 [ -8.50]( 1.49)
>    256-clients    1.00 [  0.00]( 3.23)     0.90 [-10.22]( 2.86)
>    512-clients    1.00 [  0.00](58.43)     0.90 [-10.28](47.59)
> 
> 
>    ==================================================================
>    Test          : schbench
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>      1     1.00 [ -0.00]( 5.59)     0.55 [ 45.00](11.17)
>      2     1.00 [ -0.00](14.29)     0.52 [ 47.62]( 7.53)
>      4     1.00 [ -0.00]( 1.24)     0.57 [ 42.55]( 5.73)
>      8     1.00 [ -0.00](11.16)     1.06 [ -6.12]( 2.92)
>     16     1.00 [ -0.00]( 6.81)     1.12 [-12.28](11.09)
>     32     1.00 [ -0.00]( 6.99)     1.05 [ -5.26](12.48)
>     64     1.00 [ -0.00]( 6.00)     0.96 [  4.21](18.31)
>    128     1.00 [ -0.00]( 3.26)     1.63 [-62.84](36.71)
>    256     1.00 [ -0.00](19.29)     0.97 [  3.25]( 4.94)
>    512     1.00 [ -0.00]( 1.48)     1.05 [ -4.71]( 5.11)
> 
> 
>    ==================================================================
>    Test          : new-schbench-requests-per-second
>    Units         : Normalized Requests per second
>    Interpretation: Higher is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>      1     1.00 [  0.00]( 0.00)     0.95 [ -4.99]( 0.48)
>      2     1.00 [  0.00]( 0.26)     0.96 [ -3.82]( 0.55)
>      4     1.00 [  0.00]( 0.15)     0.95 [ -4.96]( 0.27)
>      8     1.00 [  0.00]( 0.15)     0.99 [ -0.58]( 0.00)
>     16     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)
>     32     1.00 [  0.00]( 4.88)     1.04 [  4.27]( 2.42)
>     64     1.00 [  0.00]( 5.57)     0.87 [-13.10](11.51)
>    128     1.00 [  0.00]( 0.34)     0.97 [ -3.13]( 0.58)
>    256     1.00 [  0.00]( 1.95)     1.02 [  1.83]( 0.15)
>    512     1.00 [  0.00]( 0.44)     1.00 [  0.48]( 0.12)
> 
> 
>    ==================================================================
>    Test          : new-schbench-wakeup-latency
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>      1     1.00 [ -0.00]( 4.19)     1.00 [ -0.00](14.91)
>      2     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 0.00)
>      4     1.00 [ -0.00]( 8.91)     0.80 [ 20.00]( 4.43)
>      8     1.00 [ -0.00]( 7.45)     1.00 [ -0.00]( 7.45)
>     16     1.00 [ -0.00]( 4.08)     1.00 [ -0.00](10.79)
>     32     1.00 [ -0.00](16.90)     0.93 [  6.67](10.00)
>     64     1.00 [ -0.00]( 9.11)     1.12 [-12.50]( 0.00)
>    128     1.00 [ -0.00]( 7.05)     2.43 [-142.86](24.47)

OK, this was what I saw too. I'm looking into this.

>    256     1.00 [ -0.00]( 4.32)     1.02 [ -2.34]( 1.20)
>    512     1.00 [ -0.00]( 0.35)     1.01 [ -0.77]( 0.40)
> 
> 
>    ==================================================================
>    Test          : new-schbench-request-latency
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>      1     1.00 [ -0.00]( 0.78)     1.16 [-15.70]( 2.14)
>      2     1.00 [ -0.00]( 0.81)     1.13 [-13.11]( 0.62)
>      4     1.00 [ -0.00]( 0.24)     1.26 [-26.11](16.43)
>      8     1.00 [ -0.00]( 1.30)     1.03 [ -3.46]( 0.81)
>     16     1.00 [ -0.00]( 1.11)     1.02 [ -2.12]( 1.85)
>     32     1.00 [ -0.00]( 5.94)     0.96 [  4.05]( 4.48)
>     64     1.00 [ -0.00]( 6.27)     1.06 [ -6.01]( 6.67)
>    128     1.00 [ -0.00]( 0.21)     1.12 [-12.31]( 2.61)
>    256     1.00 [ -0.00](13.73)     1.06 [ -6.30]( 3.37)
>    512     1.00 [ -0.00]( 0.95)     1.05 [ -4.85]( 0.61)
> 
> 
>    ==================================================================
>    Test          : Various longer running benchmarks
>    Units         : %diff in throughput reported
>    Interpretation: Higher is better
>    Statistic     : Median
>    ==================================================================
>    Benchmarks:                 %diff
>    ycsb-cassandra              -1.21%
>    ycsb-mongodb                -0.69%
> 
>    deathstarbench-1x           -7.40%
>    deathstarbench-2x           -3.80%
>    deathstarbench-3x           -3.99%
>    deathstarbench-6x           -3.02%
> 
>    hammerdb+mysql 16VU         -2.59%
>    hammerdb+mysql 64VU         -1.05%
> 

For long-duration task, the penalty of remote cache access is severe. 
This might indicate a similar issue as tbench/netperf as you mentioned,
different processes are aggregated to different LLCs, but these 
processes interact with each other and WF_SYNC did not take effect.

> 
> Also, could you fold the below diff into your Patch2:
> 

Sure, let me apply it and do the test.

thanks,
Chenyu

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eb5a2572b4f8..6c51dd2b7b32 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7694,8 +7694,6 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, bool
>       int i, cpu, idle_cpu = -1, nr = INT_MAX;
>       struct sched_domain_shared *sd_share;
> 
> -    cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> -
>       if (sched_feat(SIS_UTIL)) {
>           sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
>           if (sd_share) {
> @@ -7707,6 +7705,8 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, bool
>           }
>       }
> 
> +    cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +
>       if (static_branch_unlikely(&sched_cluster_active)) {
>           struct sched_group *sg = sd->groups;
> 
> ---
> 
> If the SIS_UTIL cut off hits, that result of the cpumask_and() is of no
> use. To save some additional cycles, especially in cases where we target
> the LLC frequently and the search bails out because the LLC is busy,
> this overhead can be easily avoided. Since select_idle_cpu() can now be
> called twice per wakeup, this overhead can be visible in benchmarks like
> hackbench.
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-04-29 12:58 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-21  3:23 [RFC PATCH 0/5] sched: Introduce Cache aware scheduling Chen Yu
2025-04-21  3:24 ` [RFC PATCH 1/5] sched: Cache aware load-balancing Chen Yu
2025-04-21  3:24 ` [RFC PATCH 2/5] sched: Several fixes for cache aware scheduling Chen Yu
2025-04-21  3:25 ` [RFC PATCH 3/5] sched: Avoid task migration within its preferred LLC Chen Yu
2025-04-21  3:25 ` [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated Chen Yu
2025-04-24  9:22   ` Madadi Vineeth Reddy
2025-04-24 14:11     ` Chen, Yu C
2025-04-24 15:51       ` Tim Chen
2025-04-25  9:13         ` Madadi Vineeth Reddy
2025-04-25 17:29           ` Tim Chen
2025-04-25  8:58       ` Madadi Vineeth Reddy
2025-04-21  3:25 ` [RFC PATCH 5/5] sched: Add ftrace to track task migration and load balance within and across LLC Chen Yu
2025-04-29  3:47 ` [RFC PATCH 0/5] sched: Introduce Cache aware scheduling K Prateek Nayak
2025-04-29 12:57   ` Chen, Yu C

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox