public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5]
@ 2025-04-09 11:15 K Prateek Nayak
  2025-04-09 11:15 ` [RFC PATCH 1/5] sched/fair: Add push task framework K Prateek Nayak
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-09 11:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal,
	K Prateek Nayak

Hello everyone,

There was some interest at OSPM'25 to explore using the push task
mechanism for idle and newidle balance. This series implements one such
idea. The main reason for the RFC is to understand if this is the
implementation people were in favor of before trying to optimize it for
all the workloads from my test setup.

Note: The current performance of the prototype is rough. I haven't
optimized it yet since I would love some feedback first on the approach.


Current approach
================

The push task framework for fair class has been cherry-pick from
Vincent's series and has been implemented for !EAS case.

This series implements the idea from Valentin [2] where, in presence of
pushable tasks, the CPU will set itself on a per-LLC "overloaded_mask".

The inter-NUMA newidle balance has been modified to traverse the CPUs
set on the overloaded mask, first in the local-LLC, and then CPUs set on
overloaded mask of other LLCs in same NUMA node with the goal of pulling
a single task towards itself rather than performing a full fledged load
balancing.

This implements some of the ideas from David Vernet's SAHRED_RUNQ
prototype [3] except, instead of a single SHARED_RUNQ per-LLC /
per-shard, the overloaded mask serves an indicator of per-CPU rq(s)
containing pushable task that can be migrated to the CPU going idle.
This avoids having a per-SHARED_RUNQ lock at the expense of maintaining
the overloaded cpumask.

The push callback itself has been modified to try push the tasks on the
pushable task list to one of the CPUs on the "nohz.idle_cpus_mask"
taking the load off of idle balancing.


Clarification required
======================

I believe using the per-CPU pushable task list as a proxy for a single
SHARED_RUNQ was the idea Peter was implying during the discussion. Is
this correct or did I completely misunderstand it? P.S. SHARED_RUNQ
could also be modelled as a large per-LLC push list.

An alternate implementation is to allow CPUs to go to idle as quickly as
possible and then rely completely on push mechanism and the
"idle_cpu_mask" to push task to an idle CPU however this puts the burden
of moving tasks on a busy overloaded CPU which may not be ideal.

Since folks mentioned using "push mechanism" for newidle balance, was
the above idea the one they had in mind?

There seems to be some clear advantage from doing a complete balance in
the newidle path. Since the schedstats are not rigged up yet for the new
approach, I'm not completely sure where the advantages vs disadvantages
are currently.

If the current approach is right, I'll dig deeper to try address all the
shortcomings of this prototype.

Systems with unified LLC will likely run into bottlenecks to maintain a
large per-LLC mask that can have multiple concurrent updates. I have
plans to implement a "sd_shard" which shards the large LLC making the
cpumask maintenance less heavy on these systems.


References
==========

[1] https://lore.kernel.org/lkml/20250302210539.1563190-6-vincent.guittot@linaro.org/
[2] https://lore.kernel.org/lkml/xhsmh1putoxbz.mognet@vschneid-thinkpadt14sgen2i.remote.csb/
[3] https://lore.kernel.org/lkml/20231212003141.216236-1-void@manifault.com/

--
K Prateek Nayak (4):
  sched/fair: Introduce overloaded_mask in sched_domain_shared
  sched/fair: Update overloaded mask in presence of pushable task
  sched/fair: Rework inter-NUMA newidle balancing
  sched/fair: Proactive idle balance using push mechanism

Vincent Guittot (1):
  sched/fair: Add push task framework

 include/linux/sched/topology.h |   1 +
 kernel/sched/fair.c            | 297 +++++++++++++++++++++++++++++++--
 kernel/sched/sched.h           |   2 +
 kernel/sched/topology.c        |  25 ++-
 4 files changed, 306 insertions(+), 19 deletions(-)


base-commit: 6432e163ba1b7d80b5876792ce53e511f041ab91
-- 
2.34.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 1/5] sched/fair: Add push task framework
  2025-04-09 11:15 [RFC PATCH 0/5] K Prateek Nayak
@ 2025-04-09 11:15 ` K Prateek Nayak
  2025-04-10 10:00   ` Peter Zijlstra
  2025-04-09 11:15 ` [RFC PATCH 2/5] sched/fair: Introduce overloaded_mask in sched_domain_shared K Prateek Nayak
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-09 11:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal,
	K Prateek Nayak

From: Vincent Guittot <vincent.guittot@linaro.org>

Add the skeleton for push task infrastructure. The empty
push_fair_task() prototype will be used to implement proactive idle
balancing in subsequent commits.

  [ prateek: Broke off relevant bits from [1] ]

Link: https://lore.kernel.org/all/20250302210539.1563190-6-vincent.guittot@linaro.org/ [1]
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c  | 85 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  2 ++
 2 files changed, 87 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c19459c8042..98d3ed2078cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7044,6 +7044,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 }
 
+static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
 static void set_next_buddy(struct sched_entity *se);
 
 /*
@@ -7074,6 +7075,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		h_nr_idle = task_has_idle_policy(p);
 		if (task_sleep || task_delayed || !se->sched_delayed)
 			h_nr_runnable = 1;
+
+		fair_remove_pushable_task(rq, p);
 	} else {
 		cfs_rq = group_cfs_rq(se);
 		slice = cfs_rq_min_slice(cfs_rq);
@@ -8556,6 +8559,64 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	return target;
 }
 
+static inline bool fair_push_task(struct task_struct *p)
+{
+	if (!task_on_rq_queued(p))
+		return false;
+
+	if (p->se.sched_delayed)
+		return false;
+
+	if (p->nr_cpus_allowed == 1)
+		return false;
+
+	return true;
+}
+
+static inline int has_pushable_tasks(struct rq *rq)
+{
+	return !plist_head_empty(&rq->cfs.pushable_tasks);
+}
+
+/*
+ * See if the non running fair tasks on this rq can be sent on other CPUs
+ * that fits better with their profile.
+ */
+static bool push_fair_task(struct rq *rq)
+{
+	return false;
+}
+
+static void push_fair_tasks(struct rq *rq)
+{
+	/* push_fair_task() will return true if it moved a fair task */
+	while (push_fair_task(rq))
+		;
+}
+
+static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
+
+static inline void fair_queue_pushable_tasks(struct rq *rq)
+{
+	if (!has_pushable_tasks(rq))
+		return;
+
+	queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_tasks);
+}
+static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
+{
+	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+}
+
+static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
+{
+	if (fair_push_task(p)) {
+		plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+		plist_node_init(&p->pushable_tasks, p->prio);
+		plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+	}
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
@@ -8725,6 +8786,9 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	return sched_balance_newidle(rq, rf) != 0;
 }
 #else
+static inline void fair_queue_pushable_tasks(struct rq *rq) {}
+static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p) {}
+static inline void fair_add_pushable_task(struct rq *rq, struct task_struct *p) {}
 static inline void set_task_max_allowed_capacity(struct task_struct *p) {}
 #endif /* CONFIG_SMP */
 
@@ -8914,6 +8978,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 		put_prev_entity(cfs_rq, pse);
 		set_next_entity(cfs_rq, se);
 
+		/*
+		 * The previous task might be eligible for being pushed on
+		 * another cpu if it is still active.
+		 */
+		fair_add_pushable_task(rq, prev);
+
 		__set_next_task_fair(rq, p, true);
 	}
 
@@ -8986,6 +9056,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
 		cfs_rq = cfs_rq_of(se);
 		put_prev_entity(cfs_rq, se);
 	}
+
+	/*
+	 * The previous task might be eligible for being pushed on another cpu
+	 * if it is still active.
+	 */
+	fair_add_pushable_task(rq, prev);
+
 }
 
 /*
@@ -13260,6 +13337,8 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
 {
 	struct sched_entity *se = &p->se;
 
+	fair_remove_pushable_task(rq, p);
+
 #ifdef CONFIG_SMP
 	if (task_on_rq_queued(p)) {
 		/*
@@ -13277,6 +13356,11 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
 	if (hrtick_enabled_fair(rq))
 		hrtick_start_fair(rq, p);
 
+	/*
+	 * Try to push prev task before checking misfit for next task as
+	 * the migration of prev can make next fitting the CPU
+	 */
+	fair_queue_pushable_tasks(rq);
 	update_misfit_status(p, rq);
 	sched_fair_update_stop_tick(rq, p);
 }
@@ -13307,6 +13391,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
 	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
 #ifdef CONFIG_SMP
+	plist_head_init(&cfs_rq->pushable_tasks);
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5a6a503eb6d..aa92c0d75de7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -716,6 +716,8 @@ struct cfs_rq {
 	struct list_head	leaf_cfs_rq_list;
 	struct task_group	*tg;	/* group that "owns" this runqueue */
 
+	struct plist_head	pushable_tasks;
+
 	/* Locally cached copy of our task_group's idle value */
 	int			idle;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 2/5] sched/fair: Introduce overloaded_mask in sched_domain_shared
  2025-04-09 11:15 [RFC PATCH 0/5] K Prateek Nayak
  2025-04-09 11:15 ` [RFC PATCH 1/5] sched/fair: Add push task framework K Prateek Nayak
@ 2025-04-09 11:15 ` K Prateek Nayak
  2025-04-10 10:03   ` Peter Zijlstra
  2025-04-09 11:15 ` [RFC PATCH 3/5] sched/fair: Update overloaded mask in presence of pushable task K Prateek Nayak
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-09 11:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal,
	K Prateek Nayak

Introduce a new cpumask member "overloaded_mask" in sched_domain_shared.
This mask will be used to keep track of overloaded CPUs with pushable
tasks on them and will be later used by newidle balance to only scan
through the overloaded CPUs to pull a task to it.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/topology.c        | 25 ++++++++++++++++++-------
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7b4301b7235f..2fc3794fd719 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -78,6 +78,7 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+	cpumask_var_t	overloaded_mask;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index bbc2fc2c7c22..6b1ef953b571 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -638,8 +638,10 @@ static void destroy_sched_domain(struct sched_domain *sd)
 	 */
 	free_sched_groups(sd->groups, 1);
 
-	if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
+	if (sd->shared && atomic_dec_and_test(&sd->shared->ref)) {
+		free_cpumask_var(sd->shared->overloaded_mask);
 		kfree(sd->shared);
+	}
 	kfree(sd);
 }
 
@@ -2239,27 +2241,31 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 			return -ENOMEM;
 
 		for_each_cpu(j, cpu_map) {
+			int node = cpu_to_node(j);
 			struct sched_domain *sd;
 			struct sched_domain_shared *sds;
 			struct sched_group *sg;
 			struct sched_group_capacity *sgc;
 
 			sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
-					GFP_KERNEL, cpu_to_node(j));
+					GFP_KERNEL, node);
 			if (!sd)
 				return -ENOMEM;
 
 			*per_cpu_ptr(sdd->sd, j) = sd;
 
 			sds = kzalloc_node(sizeof(struct sched_domain_shared),
-					GFP_KERNEL, cpu_to_node(j));
+					GFP_KERNEL, node);
 			if (!sds)
 				return -ENOMEM;
 
+			if (!zalloc_cpumask_var_node(&sds->overloaded_mask, GFP_KERNEL, node))
+				return -ENOMEM;
+
 			*per_cpu_ptr(sdd->sds, j) = sds;
 
 			sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
-					GFP_KERNEL, cpu_to_node(j));
+					GFP_KERNEL, node);
 			if (!sg)
 				return -ENOMEM;
 
@@ -2268,7 +2274,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 			*per_cpu_ptr(sdd->sg, j) = sg;
 
 			sgc = kzalloc_node(sizeof(struct sched_group_capacity) + cpumask_size(),
-					GFP_KERNEL, cpu_to_node(j));
+					GFP_KERNEL, node);
 			if (!sgc)
 				return -ENOMEM;
 
@@ -2299,8 +2305,13 @@ static void __sdt_free(const struct cpumask *cpu_map)
 				kfree(*per_cpu_ptr(sdd->sd, j));
 			}
 
-			if (sdd->sds)
-				kfree(*per_cpu_ptr(sdd->sds, j));
+			if (sdd->sds) {
+				struct sched_domain_shared *sds = *per_cpu_ptr(sdd->sds, j);
+
+				if (sds)
+					free_cpumask_var(sds->overloaded_mask);
+				kfree(sds);
+			}
 			if (sdd->sg)
 				kfree(*per_cpu_ptr(sdd->sg, j));
 			if (sdd->sgc)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 3/5] sched/fair: Update overloaded mask in presence of pushable task
  2025-04-09 11:15 [RFC PATCH 0/5] K Prateek Nayak
  2025-04-09 11:15 ` [RFC PATCH 1/5] sched/fair: Add push task framework K Prateek Nayak
  2025-04-09 11:15 ` [RFC PATCH 2/5] sched/fair: Introduce overloaded_mask in sched_domain_shared K Prateek Nayak
@ 2025-04-09 11:15 ` K Prateek Nayak
  2025-04-14  2:28   ` Aaron Lu
  2025-04-21  5:20   ` Shrikanth Hegde
  2025-04-09 11:15 ` [RFC PATCH 4/5] sched/fair: Rework inter-NUMA newidle balancing K Prateek Nayak
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-09 11:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal,
	K Prateek Nayak

In presence of pushable tasks on the CPU, set it on the newly introduced
"overloaded+mask" in sched_domain_shared struct. This will be used by
the newidle balance to limit the scanning to these overloaded CPUs since
they contain tasks that could be run on the newly idle target.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98d3ed2078cd..834fcdd15cac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8559,6 +8559,24 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	return target;
 }
 
+static inline void update_overloaded_mask(int cpu, bool contains_pushable)
+{
+	struct sched_domain_shared *sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	cpumask_var_t overloaded_mask;
+
+	if (!sd_share)
+		return;
+
+	overloaded_mask = sd_share->overloaded_mask;
+	if (!overloaded_mask)
+		return;
+
+	if (contains_pushable)
+		cpumask_set_cpu(cpu, overloaded_mask);
+	else
+		cpumask_clear_cpu(cpu, overloaded_mask);
+}
+
 static inline bool fair_push_task(struct task_struct *p)
 {
 	if (!task_on_rq_queued(p))
@@ -8606,11 +8624,17 @@ static inline void fair_queue_pushable_tasks(struct rq *rq)
 static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
 {
 	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+
+	if (!has_pushable_tasks(rq))
+		update_overloaded_mask(rq->cpu, false);
 }
 
 static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
 {
 	if (fair_push_task(p)) {
+		if (!has_pushable_tasks(rq))
+			update_overloaded_mask(rq->cpu, true);
+
 		plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
 		plist_node_init(&p->pushable_tasks, p->prio);
 		plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 4/5] sched/fair: Rework inter-NUMA newidle balancing
  2025-04-09 11:15 [RFC PATCH 0/5] K Prateek Nayak
                   ` (2 preceding siblings ...)
  2025-04-09 11:15 ` [RFC PATCH 3/5] sched/fair: Update overloaded mask in presence of pushable task K Prateek Nayak
@ 2025-04-09 11:15 ` K Prateek Nayak
  2025-04-10 10:14   ` Peter Zijlstra
  2025-04-09 11:15 ` [RFC PATCH 5/5] sched/fair: Proactive idle balance using push mechanism K Prateek Nayak
  2025-04-09 11:17 ` [RFC PATCH 0/5] sched/fair: Idle and newidle balancing " K Prateek Nayak
  5 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-09 11:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal,
	K Prateek Nayak

With the introduction of "overloaded_mask" in sched_domain_shared
struct, it is now possible to scan through the CPUs that contain
pushable tasks that could be run on the CPU going newly idle.

Redesign the inter-NUMA newidle balancing to opportunistically pull a
task to the CPU going idle from the overloaded CPUs only.

The search starts from sd_llc and moves up until sd_numa. Since
"overloaded_mask" is per-LLC, each LLC domain is visited individually
using per-CPU sd_llc struct shared by all CPUs in an LLC.

Once visited for one, all CPUs in the LLC are marked visited and the
search resumes for the LLCs of CPUs that remain to be visited.

detach_one_task() was used in instead of pick_next_pushable_fair_task()
since detach_one_task() also considers the CPU affinity of the task
being pulled as opposed to pick_next_pushable_fair_task() which returns
the first pushable task.

Since each iteration of overloaded_mask rechecks the idle state of the
CPU doing newidle balance, the initial gating factor based on
"rq->avg_idle" has been removed.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 129 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 117 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 834fcdd15cac..93f180b67899 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12864,6 +12864,100 @@ static inline bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle
 static inline void nohz_newidle_balance(struct rq *this_rq) { }
 #endif /* CONFIG_NO_HZ_COMMON */
 
+static inline bool sched_newidle_continue_balance(struct rq *rq)
+{
+	return !rq->nr_running && !rq->ttwu_pending;
+}
+
+static inline int sched_newidle_pull_overloaded(struct sched_domain *sd,
+						struct rq *this_rq,
+						int *continue_balancing)
+{
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+	int cpu, this_cpu = cpu_of(this_rq);
+	struct sched_domain *sd_parent;
+	struct lb_env env = {
+		.dst_cpu	= this_cpu,
+		.dst_rq		= this_rq,
+		.idle		= CPU_NEWLY_IDLE,
+	};
+
+
+	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
+
+next_domain:
+	env.sd = sd;
+	/* Allow migrating cache_hot tasks too. */
+	sd->nr_balance_failed = sd->cache_nice_tries + 1;
+
+	for_each_cpu_wrap(cpu, cpus, this_cpu) {
+		struct sched_domain_shared *sd_share;
+		struct cpumask *overloaded_mask;
+		struct sched_domain *cpu_llc;
+		int overloaded_cpu;
+
+		cpu_llc = rcu_dereference(per_cpu(sd_llc, cpu));
+		if (!cpu_llc)
+			break;
+
+		sd_share = cpu_llc->shared;
+		if (!sd_share)
+			break;
+
+		overloaded_mask = sd_share->overloaded_mask;
+		if (!overloaded_mask)
+			break;
+
+		for_each_cpu_wrap(overloaded_cpu, overloaded_mask, this_cpu + 1) {
+			struct rq *overloaded_rq = cpu_rq(overloaded_cpu);
+			struct task_struct *p = NULL;
+
+			if (sched_newidle_continue_balance(this_rq)) {
+				*continue_balancing = 0;
+				return 0;
+			}
+
+			/* Quick peek to find if pushable tasks exist. */
+			if (!has_pushable_tasks(overloaded_rq))
+				continue;
+
+			scoped_guard (rq_lock, overloaded_rq) {
+				update_rq_clock(overloaded_rq);
+
+				if (!has_pushable_tasks(overloaded_rq))
+					break;
+
+				env.src_cpu = overloaded_cpu;
+				env.src_rq = overloaded_rq;
+
+				p = detach_one_task(&env);
+			}
+
+			if (!p)
+				continue;
+
+			attach_one_task(this_rq, p);
+			return 1;
+		}
+
+		cpumask_andnot(cpus, cpus, sched_domain_span(cpu_llc));
+	}
+
+	if (sched_newidle_continue_balance(this_rq)) {
+		*continue_balancing = 0;
+		return 0;
+	}
+
+	sd_parent = sd->parent;
+	if (sd_parent && !(sd_parent->flags & SD_NUMA)) {
+		cpumask_andnot(cpus, sched_domain_span(sd_parent), sched_domain_span(sd));
+		sd = sd_parent;
+		goto next_domain;
+	}
+
+	return 0;
+}
+
 /*
  * sched_balance_newidle is called by schedule() if this_cpu is about to become
  * idle. Attempts to pull tasks from other CPUs.
@@ -12881,6 +12975,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	u64 t0, t1, curr_cost = 0;
 	struct sched_domain *sd;
 	int pulled_task = 0;
+	u64 domain_cost;
 
 	update_misfit_status(NULL, this_rq);
 
@@ -12913,28 +13008,34 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 	rq_unpin_lock(this_rq, rf);
 
 	rcu_read_lock();
-	sd = rcu_dereference_check_sched_domain(this_rq->sd);
-
-	if (!get_rd_overloaded(this_rq->rd) ||
-	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
-
-		if (sd)
-			update_next_balance(sd, &next_balance);
+	if (!get_rd_overloaded(this_rq->rd)) {
 		rcu_read_unlock();
-
 		goto out;
 	}
 	rcu_read_unlock();
 
 	raw_spin_rq_unlock(this_rq);
 
+	rcu_read_lock();
 	t0 = sched_clock_cpu(this_cpu);
-	sched_balance_update_blocked_averages(this_cpu);
 
-	rcu_read_lock();
-	for_each_domain(this_cpu, sd) {
-		u64 domain_cost;
+	sd = rcu_dereference(per_cpu(sd_llc, this_cpu));
+	if (sd) {
+		pulled_task = sched_newidle_pull_overloaded(sd, this_rq, &continue_balancing);
+
+		t1 = sched_clock_cpu(this_cpu);
+		domain_cost = t1 - t0;
+		curr_cost += domain_cost;
+		t0 = t1;
 
+		if (pulled_task || !continue_balancing)
+			goto skip_numa;
+	}
+
+	sched_balance_update_blocked_averages(this_cpu);
+
+	sd = rcu_dereference(per_cpu(sd_numa, this_cpu));
+	while (sd) {
 		update_next_balance(sd, &next_balance);
 
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
@@ -12960,7 +13061,11 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 		 */
 		if (pulled_task || !continue_balancing)
 			break;
+
+		sd = sd->parent;
 	}
+
+skip_numa:
 	rcu_read_unlock();
 
 	raw_spin_rq_lock(this_rq);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 5/5] sched/fair: Proactive idle balance using push mechanism
  2025-04-09 11:15 [RFC PATCH 0/5] K Prateek Nayak
                   ` (3 preceding siblings ...)
  2025-04-09 11:15 ` [RFC PATCH 4/5] sched/fair: Rework inter-NUMA newidle balancing K Prateek Nayak
@ 2025-04-09 11:15 ` K Prateek Nayak
  2025-04-10 10:29   ` Peter Zijlstra
  2025-04-09 11:17 ` [RFC PATCH 0/5] sched/fair: Idle and newidle balancing " K Prateek Nayak
  5 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-09 11:15 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal,
	K Prateek Nayak

Proactively try to push tasks to one of the CPUs set in the
"nohz.idle_cpus_mask" from the push callback.

pick_next_pushable_fair_task() is taken from Vincent's series [1] as is
but the locking rules in push_fair_task() has been relaxed to release
the local rq lock after dequeuing the task and reacquiring it after
pushing it to the idle target.

double_lock_balance() used in RT seems necessary to maintain strict
priority ordering however that may not be necessary for fair tasks.

Link: https://lore.kernel.org/all/20250302210539.1563190-6-vincent.guittot@linaro.org/ [1]
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 59 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 93f180b67899..b2b316e75ad0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8596,12 +8596,71 @@ static inline int has_pushable_tasks(struct rq *rq)
 	return !plist_head_empty(&rq->cfs.pushable_tasks);
 }
 
+static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
+{
+	struct task_struct *p;
+
+	if (!has_pushable_tasks(rq))
+		return NULL;
+
+	p = plist_first_entry(&rq->cfs.pushable_tasks,
+			      struct task_struct, pushable_tasks);
+
+	WARN_ON_ONCE(rq->cpu != task_cpu(p));
+	WARN_ON_ONCE(task_current(rq, p));
+	WARN_ON_ONCE(p->nr_cpus_allowed <= 1);
+	WARN_ON_ONCE(!task_on_rq_queued(p));
+
+	/*
+	 * Remove task from the pushable list as we try only once after that
+	 * the task has been put back in enqueued list.
+	 */
+	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+
+	return p;
+}
+
+static void fair_add_pushable_task(struct rq *rq, struct task_struct *p);
+static void attach_one_task(struct rq *rq, struct task_struct *p);
+
 /*
  * See if the non running fair tasks on this rq can be sent on other CPUs
  * that fits better with their profile.
  */
 static bool push_fair_task(struct rq *rq)
 {
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+	struct task_struct *p = pick_next_pushable_fair_task(rq);
+	int cpu, this_cpu = cpu_of(rq);
+
+	if (!p)
+		return false;
+
+	if (!cpumask_and(cpus, nohz.idle_cpus_mask, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)))
+		goto requeue;
+
+	if (!cpumask_and(cpus, cpus, p->cpus_ptr))
+		goto requeue;
+
+	for_each_cpu_wrap(cpu, cpus, this_cpu + 1) {
+		struct rq *target_rq;
+
+		if (!idle_cpu(cpu))
+			continue;
+
+		target_rq = cpu_rq(cpu);
+		deactivate_task(rq, p, 0);
+		set_task_cpu(p, cpu);
+		raw_spin_rq_unlock(rq);
+
+		attach_one_task(target_rq, p);
+		raw_spin_rq_lock(rq);
+
+		return true;
+	}
+
+requeue:
+	fair_add_pushable_task(rq, p);
 	return false;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/5] sched/fair: Idle and newidle balancing using push mechanism.
  2025-04-09 11:15 [RFC PATCH 0/5] K Prateek Nayak
                   ` (4 preceding siblings ...)
  2025-04-09 11:15 ` [RFC PATCH 5/5] sched/fair: Proactive idle balance using push mechanism K Prateek Nayak
@ 2025-04-09 11:17 ` K Prateek Nayak
  5 siblings, 0 replies; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-09 11:17 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal

Missed the most important part ...

Subject: sched/fair: Idle and newidle balancing using push mechanism.

-- 
Thanks and Regards,
Prateek

On 4/9/2025 4:45 PM, K Prateek Nayak wrote:
> Hello everyone,
> 
> There was some interest at OSPM'25 to explore using the push task
> mechanism for idle and newidle balance. This series implements one such
> idea. The main reason for the RFC is to understand if this is the
> implementation people were in favor of before trying to optimize it for
> all the workloads from my test setup.
> 
> Note: The current performance of the prototype is rough. I haven't
> optimized it yet since I would love some feedback first on the approach.
> 
> 
> Current approach
> ================
> 
> The push task framework for fair class has been cherry-pick from
> Vincent's series and has been implemented for !EAS case.
> 
> This series implements the idea from Valentin [2] where, in presence of
> pushable tasks, the CPU will set itself on a per-LLC "overloaded_mask".
> 
> The inter-NUMA newidle balance has been modified to traverse the CPUs
> set on the overloaded mask, first in the local-LLC, and then CPUs set on
> overloaded mask of other LLCs in same NUMA node with the goal of pulling
> a single task towards itself rather than performing a full fledged load
> balancing.
> 
> This implements some of the ideas from David Vernet's SAHRED_RUNQ
> prototype [3] except, instead of a single SHARED_RUNQ per-LLC /
> per-shard, the overloaded mask serves an indicator of per-CPU rq(s)
> containing pushable task that can be migrated to the CPU going idle.
> This avoids having a per-SHARED_RUNQ lock at the expense of maintaining
> the overloaded cpumask.
> 
> The push callback itself has been modified to try push the tasks on the
> pushable task list to one of the CPUs on the "nohz.idle_cpus_mask"
> taking the load off of idle balancing.
> 
> 
> Clarification required
> ======================
> 
> I believe using the per-CPU pushable task list as a proxy for a single
> SHARED_RUNQ was the idea Peter was implying during the discussion. Is
> this correct or did I completely misunderstand it? P.S. SHARED_RUNQ
> could also be modelled as a large per-LLC push list.
> 
> An alternate implementation is to allow CPUs to go to idle as quickly as
> possible and then rely completely on push mechanism and the
> "idle_cpu_mask" to push task to an idle CPU however this puts the burden
> of moving tasks on a busy overloaded CPU which may not be ideal.
> 
> Since folks mentioned using "push mechanism" for newidle balance, was
> the above idea the one they had in mind?
> 
> There seems to be some clear advantage from doing a complete balance in
> the newidle path. Since the schedstats are not rigged up yet for the new
> approach, I'm not completely sure where the advantages vs disadvantages
> are currently.
> 
> If the current approach is right, I'll dig deeper to try address all the
> shortcomings of this prototype.
> 
> Systems with unified LLC will likely run into bottlenecks to maintain a
> large per-LLC mask that can have multiple concurrent updates. I have
> plans to implement a "sd_shard" which shards the large LLC making the
> cpumask maintenance less heavy on these systems.
> 
> 
> References
> ==========
> 
> [1] https://lore.kernel.org/lkml/20250302210539.1563190-6-vincent.guittot@linaro.org/
> [2] https://lore.kernel.org/lkml/xhsmh1putoxbz.mognet@vschneid-thinkpadt14sgen2i.remote.csb/
> [3] https://lore.kernel.org/lkml/20231212003141.216236-1-void@manifault.com/
> 
> --
> K Prateek Nayak (4):
>    sched/fair: Introduce overloaded_mask in sched_domain_shared
>    sched/fair: Update overloaded mask in presence of pushable task
>    sched/fair: Rework inter-NUMA newidle balancing
>    sched/fair: Proactive idle balance using push mechanism
> 
> Vincent Guittot (1):
>    sched/fair: Add push task framework
> 
>   include/linux/sched/topology.h |   1 +
>   kernel/sched/fair.c            | 297 +++++++++++++++++++++++++++++++--
>   kernel/sched/sched.h           |   2 +
>   kernel/sched/topology.c        |  25 ++-
>   4 files changed, 306 insertions(+), 19 deletions(-)
> 
> 
> base-commit: 6432e163ba1b7d80b5876792ce53e511f041ab91



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/5] sched/fair: Add push task framework
  2025-04-09 11:15 ` [RFC PATCH 1/5] sched/fair: Add push task framework K Prateek Nayak
@ 2025-04-10 10:00   ` Peter Zijlstra
  2025-04-10 15:25     ` K Prateek Nayak
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-04-10 10:00 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal

On Wed, Apr 09, 2025 at 11:15:35AM +0000, K Prateek Nayak wrote:

> +static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
> +{
> +	if (fair_push_task(p)) {
> +		plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +		plist_node_init(&p->pushable_tasks, p->prio);

I gotta aks, why do we care about ordering the push list on p->prio for
fair?

> +		plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +	}
> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,

> @@ -8914,6 +8978,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>  		put_prev_entity(cfs_rq, pse);
>  		set_next_entity(cfs_rq, se);
>  
> +		/*
> +		 * The previous task might be eligible for being pushed on
> +		 * another cpu if it is still active.
> +		 */
> +		fair_add_pushable_task(rq, prev);
> +
>  		__set_next_task_fair(rq, p, true);
>  	}
>  
> @@ -8986,6 +9056,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
>  		cfs_rq = cfs_rq_of(se);
>  		put_prev_entity(cfs_rq, se);
>  	}
> +
> +	/*
> +	 * The previous task might be eligible for being pushed on another cpu
> +	 * if it is still active.
> +	 */
> +	fair_add_pushable_task(rq, prev);
> +
>  }

These two are tricky; while they will work with a push balance callback,
they will cause some pain with pulling from the push lists; a-la
newidle.

Notably, we must be sure to check ->on_cpu.

Perhaps later patches add this, let me continue reading...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/5] sched/fair: Introduce overloaded_mask in sched_domain_shared
  2025-04-09 11:15 ` [RFC PATCH 2/5] sched/fair: Introduce overloaded_mask in sched_domain_shared K Prateek Nayak
@ 2025-04-10 10:03   ` Peter Zijlstra
  2025-04-10 15:26     ` K Prateek Nayak
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-04-10 10:03 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal

On Wed, Apr 09, 2025 at 11:15:36AM +0000, K Prateek Nayak wrote:
> Introduce a new cpumask member "overloaded_mask" in sched_domain_shared.
> This mask will be used to keep track of overloaded CPUs with pushable
> tasks on them and will be later used by newidle balance to only scan
> through the overloaded CPUs to pull a task to it.

'to pull a task from it.' ?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Rework inter-NUMA newidle balancing
  2025-04-09 11:15 ` [RFC PATCH 4/5] sched/fair: Rework inter-NUMA newidle balancing K Prateek Nayak
@ 2025-04-10 10:14   ` Peter Zijlstra
  2025-04-10 15:27     ` K Prateek Nayak
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-04-10 10:14 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal

On Wed, Apr 09, 2025 at 11:15:38AM +0000, K Prateek Nayak wrote:
> +static inline int sched_newidle_pull_overloaded(struct sched_domain *sd,
> +						struct rq *this_rq,
> +						int *continue_balancing)
> +{
> +	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> +	int cpu, this_cpu = cpu_of(this_rq);
> +	struct sched_domain *sd_parent;
> +	struct lb_env env = {
> +		.dst_cpu	= this_cpu,
> +		.dst_rq		= this_rq,
> +		.idle		= CPU_NEWLY_IDLE,
> +	};
> +
> +
> +	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
> +
> +next_domain:
> +	env.sd = sd;
> +	/* Allow migrating cache_hot tasks too. */
> +	sd->nr_balance_failed = sd->cache_nice_tries + 1;
> +
> +	for_each_cpu_wrap(cpu, cpus, this_cpu) {
> +		struct sched_domain_shared *sd_share;
> +		struct cpumask *overloaded_mask;
> +		struct sched_domain *cpu_llc;
> +		int overloaded_cpu;
> +
> +		cpu_llc = rcu_dereference(per_cpu(sd_llc, cpu));
> +		if (!cpu_llc)
> +			break;
> +
> +		sd_share = cpu_llc->shared;
> +		if (!sd_share)
> +			break;
> +
> +		overloaded_mask = sd_share->overloaded_mask;
> +		if (!overloaded_mask)
> +			break;
> +
> +		for_each_cpu_wrap(overloaded_cpu, overloaded_mask, this_cpu + 1) {
> +			struct rq *overloaded_rq = cpu_rq(overloaded_cpu);
> +			struct task_struct *p = NULL;
> +
> +			if (sched_newidle_continue_balance(this_rq)) {
> +				*continue_balancing = 0;
> +				return 0;
> +			}
> +
> +			/* Quick peek to find if pushable tasks exist. */
> +			if (!has_pushable_tasks(overloaded_rq))
> +				continue;
> +
> +			scoped_guard (rq_lock, overloaded_rq) {
> +				update_rq_clock(overloaded_rq);
> +
> +				if (!has_pushable_tasks(overloaded_rq))
> +					break;

You can skip the clock update if there aren't any tasks to grab.

> +
> +				env.src_cpu = overloaded_cpu;
> +				env.src_rq = overloaded_rq;
> +
> +				p = detach_one_task(&env);

Yep, detach_one_task() uses can_migrate_task() which checks
task_on_cpu(), so that's all good :-)

> +			}
> +
> +			if (!p)
> +				continue;
> +
> +			attach_one_task(this_rq, p);
> +			return 1;
> +		}
> +
> +		cpumask_andnot(cpus, cpus, sched_domain_span(cpu_llc));
> +	}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 5/5] sched/fair: Proactive idle balance using push mechanism
  2025-04-09 11:15 ` [RFC PATCH 5/5] sched/fair: Proactive idle balance using push mechanism K Prateek Nayak
@ 2025-04-10 10:29   ` Peter Zijlstra
  2025-04-10 15:37     ` K Prateek Nayak
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-04-10 10:29 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal

On Wed, Apr 09, 2025 at 11:15:39AM +0000, K Prateek Nayak wrote:
> Proactively try to push tasks to one of the CPUs set in the
> "nohz.idle_cpus_mask" from the push callback.
> 
> pick_next_pushable_fair_task() is taken from Vincent's series [1] as is
> but the locking rules in push_fair_task() has been relaxed to release
> the local rq lock after dequeuing the task and reacquiring it after
> pushing it to the idle target.
> 
> double_lock_balance() used in RT seems necessary to maintain strict
> priority ordering however that may not be necessary for fair tasks.

Agreed; don't use double_lock_balance() if you can at all avoid it. It
is quite terrible.


>  /*
>   * See if the non running fair tasks on this rq can be sent on other CPUs
>   * that fits better with their profile.
>   */
>  static bool push_fair_task(struct rq *rq)
>  {
> +	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> +	struct task_struct *p = pick_next_pushable_fair_task(rq);
> +	int cpu, this_cpu = cpu_of(rq);
> +
> +	if (!p)
> +		return false;
> +
> +	if (!cpumask_and(cpus, nohz.idle_cpus_mask, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)))
> +		goto requeue;

So I think the main goal here should be to get rid of the whole single
nohz balancing thing.

This global state/mask has been shown to be a problem over and over again.

Ideally we keep a nohz idle mask per LLC (right next to the overload
mask you introduced earlier), along with a bit in the sched_domain tree
upwards of that to indicate a particular llc/ node / distance-group has
nohz idle.

Then if the topmost domain has the bit set it means there are nohz cpus
to be found, and we can (slowly) iterate the domain tree up from
overloaded LLC to push tasks around.

Anyway, yes, you gotta start somewhere :-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/5] sched/fair: Add push task framework
  2025-04-10 10:00   ` Peter Zijlstra
@ 2025-04-10 15:25     ` K Prateek Nayak
  0 siblings, 0 replies; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-10 15:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal

Hello Peter,

Thank you for the review.

On 4/10/2025 3:30 PM, Peter Zijlstra wrote:
> On Wed, Apr 09, 2025 at 11:15:35AM +0000, K Prateek Nayak wrote:
> 
>> +static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
>> +{
>> +	if (fair_push_task(p)) {
>> +		plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
>> +		plist_node_init(&p->pushable_tasks, p->prio);
> 
> I gotta aks, why do we care about ordering the push list on p->prio for
> fair?

This bit was carried over from Vincent's series. I assume it was
inspired by the RT, DL's push mechanism. Perhaps fair can get away with
MRU ordering similar to rq->cfs_tasks?

> 
>> +		plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
>> +	}
>> +}
>> +
>>   /*
>>    * select_task_rq_fair: Select target runqueue for the waking task in domains
>>    * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAKE,
> 
>> @@ -8914,6 +8978,12 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
>>   		put_prev_entity(cfs_rq, pse);
>>   		set_next_entity(cfs_rq, se);
>>   
>> +		/*
>> +		 * The previous task might be eligible for being pushed on
>> +		 * another cpu if it is still active.
>> +		 */
>> +		fair_add_pushable_task(rq, prev);
>> +
>>   		__set_next_task_fair(rq, p, true);

So __set_next_task_fair() removes the current task from the pushable list
when it is picked ...

>>   	}
>>   
>> @@ -8986,6 +9056,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct t
>>   		cfs_rq = cfs_rq_of(se);
>>   		put_prev_entity(cfs_rq, se);
>>   	}
>> +
>> +	/*
>> +	 * The previous task might be eligible for being pushed on another cpu
>> +	 * if it is still active.
>> +	 */
>> +	fair_add_pushable_task(rq, prev);
>> +
>>   }
> 
> These two are tricky; while they will work with a push balance callback,
> they will cause some pain with pulling from the push lists; a-la
> newidle.
> 
> Notably, we must be sure to check ->on_cpu.

... so we should be safe here. Other considerations are covered by
detach_task() but pushable task lists to only contain queued tasks
that can run on more than one CPU and Vincent was cautious enough to
place a bunch of WARN_ON_ONCE() in pick_next_pushable_fair_task() to
catch those cases :)

> 
> Perhaps later patches add this, let me continue reading...

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/5] sched/fair: Introduce overloaded_mask in sched_domain_shared
  2025-04-10 10:03   ` Peter Zijlstra
@ 2025-04-10 15:26     ` K Prateek Nayak
  0 siblings, 0 replies; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-10 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal

Hello Peter,

On 4/10/2025 3:33 PM, Peter Zijlstra wrote:
> On Wed, Apr 09, 2025 at 11:15:36AM +0000, K Prateek Nayak wrote:
>> Introduce a new cpumask member "overloaded_mask" in sched_domain_shared.
>> This mask will be used to keep track of overloaded CPUs with pushable
>> tasks on them and will be later used by newidle balance to only scan
>> through the overloaded CPUs to pull a task to it.
> 
> 'to pull a task from it.' ?

Right! Sorry about that.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 4/5] sched/fair: Rework inter-NUMA newidle balancing
  2025-04-10 10:14   ` Peter Zijlstra
@ 2025-04-10 15:27     ` K Prateek Nayak
  0 siblings, 0 replies; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-10 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal

On 4/10/2025 3:44 PM, Peter Zijlstra wrote:

[..snip..]

>> +				continue;
>> +
>> +			scoped_guard (rq_lock, overloaded_rq) {
>> +				update_rq_clock(overloaded_rq);
>> +
>> +				if (!has_pushable_tasks(overloaded_rq))
>> +					break;
> 
> You can skip the clock update if there aren't any tasks to grab.

Ack.

> 
>> +
>> +				env.src_cpu = overloaded_cpu;
>> +				env.src_rq = overloaded_rq;
>> +
>> +				p = detach_one_task(&env);
> 
> Yep, detach_one_task() uses can_migrate_task() which checks
> task_on_cpu(), so that's all good :-)
> 
>> +			}
>> +
>> +			if (!p)
>> +				continue;
>> +
>> +			attach_one_task(this_rq, p);
>> +			return 1;
>> +		}
>> +
>> +		cpumask_andnot(cpus, cpus, sched_domain_span(cpu_llc));
>> +	}

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 5/5] sched/fair: Proactive idle balance using push mechanism
  2025-04-10 10:29   ` Peter Zijlstra
@ 2025-04-10 15:37     ` K Prateek Nayak
  0 siblings, 0 replies; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-10 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal

On 4/10/2025 3:59 PM, Peter Zijlstra wrote:

[..snip..]

>>   /*
>>    * See if the non running fair tasks on this rq can be sent on other CPUs
>>    * that fits better with their profile.
>>    */
>>   static bool push_fair_task(struct rq *rq)
>>   {
>> +	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
>> +	struct task_struct *p = pick_next_pushable_fair_task(rq);
>> +	int cpu, this_cpu = cpu_of(rq);
>> +
>> +	if (!p)
>> +		return false;
>> +
>> +	if (!cpumask_and(cpus, nohz.idle_cpus_mask, housekeeping_cpumask(HK_TYPE_KERNEL_NOISE)))
>> +		goto requeue;
> 
> So I think the main goal here should be to get rid of the whole single
> nohz balancing thing.
> 
> This global state/mask has been shown to be a problem over and over again.
> 
> Ideally we keep a nohz idle mask per LLC (right next to the overload
> mask you introduced earlier), along with a bit in the sched_domain tree
> upwards of that to indicate a particular llc/ node / distance-group has
> nohz idle.
> 
> Then if the topmost domain has the bit set it means there are nohz cpus
> to be found, and we can (slowly) iterate the domain tree up from
> overloaded LLC to push tasks around.

I'll to through fair.c to understand all the usecases of
"nohz.idle_cpus_mask" and then start with this bit for v2 to see if that
blows up in some way. I'll be back shortly.

> 
> Anyway, yes, you gotta start somewhere :-)

Thanks a ton for the initial review. I'll go analyze more to see what
bits are making benchmarks go sad.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/5] sched/fair: Update overloaded mask in presence of pushable task
  2025-04-09 11:15 ` [RFC PATCH 3/5] sched/fair: Update overloaded mask in presence of pushable task K Prateek Nayak
@ 2025-04-14  2:28   ` Aaron Lu
  2025-04-14  3:13     ` K Prateek Nayak
  2025-04-21  5:20   ` Shrikanth Hegde
  1 sibling, 1 reply; 21+ messages in thread
From: Aaron Lu @ 2025-04-14  2:28 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal

Hi Prateek,

On Wed, Apr 09, 2025 at 11:15:37AM +0000, K Prateek Nayak wrote:
> In presence of pushable tasks on the CPU, set it on the newly introduced
> "overloaded+mask" in sched_domain_shared struct. This will be used by
> the newidle balance to limit the scanning to these overloaded CPUs since
> they contain tasks that could be run on the newly idle target.
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/fair.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 98d3ed2078cd..834fcdd15cac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8559,6 +8559,24 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  	return target;
>  }
>  
> +static inline void update_overloaded_mask(int cpu, bool contains_pushable)
> +{
> +	struct sched_domain_shared *sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));

I got a suspicious RCU usage warning for this line while testing your
series. Since rq lock is held here, rcu_dereference() should not be
necessary.

> +	cpumask_var_t overloaded_mask;
> +
> +	if (!sd_share)
> +		return;
> +
> +	overloaded_mask = sd_share->overloaded_mask;
> +	if (!overloaded_mask)
> +		return;
> +
> +	if (contains_pushable)
> +		cpumask_set_cpu(cpu, overloaded_mask);
> +	else
> +		cpumask_clear_cpu(cpu, overloaded_mask);
> +}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/5] sched/fair: Update overloaded mask in presence of pushable task
  2025-04-14  2:28   ` Aaron Lu
@ 2025-04-14  3:13     ` K Prateek Nayak
  0 siblings, 0 replies; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-14  3:13 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal

Hello Aaron,

On 4/14/2025 7:58 AM, Aaron Lu wrote:
> Hi Prateek,
> 
> On Wed, Apr 09, 2025 at 11:15:37AM +0000, K Prateek Nayak wrote:
>> In presence of pushable tasks on the CPU, set it on the newly introduced
>> "overloaded+mask" in sched_domain_shared struct. This will be used by
>> the newidle balance to limit the scanning to these overloaded CPUs since
>> they contain tasks that could be run on the newly idle target.
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>>   kernel/sched/fair.c | 24 ++++++++++++++++++++++++
>>   1 file changed, 24 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 98d3ed2078cd..834fcdd15cac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8559,6 +8559,24 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>   	return target;
>>   }
>>   
>> +static inline void update_overloaded_mask(int cpu, bool contains_pushable)
>> +{
>> +	struct sched_domain_shared *sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> 
> I got a suspicious RCU usage warning for this line while testing your
> series. Since rq lock is held here, rcu_dereference() should not be
> necessary.

Thank you for reporting this. I'll make sure to run with LOCKDEP next
time around. Note: The performance aspect is still quite bad with this
series an the intent for the RFC was to vet the idea and to understand
if I got the basic implementation details right.

> 
>> +	cpumask_var_t overloaded_mask;
>> +
>> +	if (!sd_share)
>> +		return;
>> +
>> +	overloaded_mask = sd_share->overloaded_mask;
>> +	if (!overloaded_mask)
>> +		return;
>> +
>> +	if (contains_pushable)
>> +		cpumask_set_cpu(cpu, overloaded_mask);
>> +	else
>> +		cpumask_clear_cpu(cpu, overloaded_mask);
>> +}

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/5] sched/fair: Update overloaded mask in presence of pushable task
  2025-04-09 11:15 ` [RFC PATCH 3/5] sched/fair: Update overloaded mask in presence of pushable task K Prateek Nayak
  2025-04-14  2:28   ` Aaron Lu
@ 2025-04-21  5:20   ` Shrikanth Hegde
  2025-04-21  5:54     ` K Prateek Nayak
  1 sibling, 1 reply; 21+ messages in thread
From: Shrikanth Hegde @ 2025-04-21  5:20 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal



On 4/9/25 16:45, K Prateek Nayak wrote:

Hi Prateek. Feel free to cc me in the future versions.
This seems interesting way if it can get us rid of newidle balance overheads.

> In presence of pushable tasks on the CPU, set it on the newly introduced
> "overloaded+mask" in sched_domain_shared struct. This will be used by
> the newidle balance to limit the scanning to these overloaded CPUs since
> they contain tasks that could be run on the newly idle target.
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>   kernel/sched/fair.c | 24 ++++++++++++++++++++++++
>   1 file changed, 24 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 98d3ed2078cd..834fcdd15cac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8559,6 +8559,24 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   	return target;
>   }
>   
> +static inline void update_overloaded_mask(int cpu, bool contains_pushable)
> +{
> +	struct sched_domain_shared *sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> +	cpumask_var_t overloaded_mask;
> +
> +	if (!sd_share)
> +		return;
> +
> +	overloaded_mask = sd_share->overloaded_mask;
> +	if (!overloaded_mask)
> +		return;
> +
> +	if (contains_pushable)
> +		cpumask_set_cpu(cpu, overloaded_mask);
> +	else
> +		cpumask_clear_cpu(cpu, overloaded_mask);
> +}
> +

I was getting below error when compiling. Noticed that overloaded_mask is a local update and it wouldn't
update the actual overloaded_mask.

Compilation Error:
kernel/sched/fair.c: In function ‘update_overloaded_mask’:
kernel/sched/fair.c:8570:25: error: assignment to expression with array type
  8570 |         overloaded_mask = sd_share->overloaded_mask;
       |                         ^
kernel/sched/fair.c:8571:13: warning: the address of ‘overloaded_mask’ will always evaluate as ‘true’ [-Waddress]
  8571 |         if (!overloaded_mask)



Made the below change. Also you would need rcu protection for sd_share i think.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bca3db5d0ac0..818d4769ec55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8562,19 +8562,18 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
  static inline void update_overloaded_mask(int cpu, bool contains_pushable)
  {
         struct sched_domain_shared *sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
-       cpumask_var_t overloaded_mask;
  
         if (!sd_share)
                 return;
  
-       overloaded_mask = sd_share->overloaded_mask;
-       if (!overloaded_mask)
+       if (!sd_share->overloaded_mask)
                 return;
  
+       guard(rcu)();
         if (contains_pushable)
-               cpumask_set_cpu(cpu, overloaded_mask);
+               cpumask_set_cpu(cpu, sd_share->overloaded_mask);
         else
-               cpumask_clear_cpu(cpu, overloaded_mask);
+               cpumask_clear_cpu(cpu, sd_share->overloaded_mask);
  }

>   static inline bool fair_push_task(struct task_struct *p)
>   {
>   	if (!task_on_rq_queued(p))
> @@ -8606,11 +8624,17 @@ static inline void fair_queue_pushable_tasks(struct rq *rq)
>   static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
>   {
>   	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> +
> +	if (!has_pushable_tasks(rq))
> +		update_overloaded_mask(rq->cpu, false);
>   }
>   
>   static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
>   {
>   	if (fair_push_task(p)) {
> +		if (!has_pushable_tasks(rq))
> +			update_overloaded_mask(rq->cpu, true);
> +
>   		plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
>   		plist_node_init(&p->pushable_tasks, p->prio);
>   		plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/5] sched/fair: Update overloaded mask in presence of pushable task
  2025-04-21  5:20   ` Shrikanth Hegde
@ 2025-04-21  5:54     ` K Prateek Nayak
  2025-04-21  8:03       ` Shrikanth Hegde
  0 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-21  5:54 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal

Hello Shrikanth,

On 4/21/2025 10:50 AM, Shrikanth Hegde wrote:
> 
> 
> On 4/9/25 16:45, K Prateek Nayak wrote:
> 
> Hi Prateek. Feel free to cc me in the future versions.


Will do! Thank you for taking a look at the series.

> This seems interesting way if it can get us rid of newidle balance overheads.
> 
>> In presence of pushable tasks on the CPU, set it on the newly introduced
>> "overloaded+mask" in sched_domain_shared struct. This will be used by
>> the newidle balance to limit the scanning to these overloaded CPUs since
>> they contain tasks that could be run on the newly idle target.
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>>   kernel/sched/fair.c | 24 ++++++++++++++++++++++++
>>   1 file changed, 24 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 98d3ed2078cd..834fcdd15cac 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8559,6 +8559,24 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>>       return target;
>>   }
>> +static inline void update_overloaded_mask(int cpu, bool contains_pushable)
>> +{
>> +    struct sched_domain_shared *sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
>> +    cpumask_var_t overloaded_mask;
>> +
>> +    if (!sd_share)
>> +        return;
>> +
>> +    overloaded_mask = sd_share->overloaded_mask;
>> +    if (!overloaded_mask)
>> +        return;
>> +
>> +    if (contains_pushable)
>> +        cpumask_set_cpu(cpu, overloaded_mask);
>> +    else
>> +        cpumask_clear_cpu(cpu, overloaded_mask);
>> +}
>> +
> 
> I was getting below error when compiling. Noticed that overloaded_mask is a local update and it wouldn't
> update the actual overloaded_mask.

Interesting! Question: Do you have "CONFIG_CPUMASK_OFFSTACK" enabled in
your config? (me makes a note to test this too in the next iteration)
Looking at the arch specific Kconfigs, there is a slight difference in
how this is toggled on x86 vs powerpc and I'm wondering if that is why
I haven't seen this warning in my testing.

> 
> Compilation Error:
> kernel/sched/fair.c: In function ‘update_overloaded_mask’:
> kernel/sched/fair.c:8570:25: error: assignment to expression with array type
>   8570 |         overloaded_mask = sd_share->overloaded_mask;
>        |                         ^
> kernel/sched/fair.c:8571:13: warning: the address of ‘overloaded_mask’ will always evaluate as ‘true’ [-Waddress]
>   8571 |         if (!overloaded_mask)
> 
> 
> 
> Made the below change. Also you would need rcu protection for sd_share i think.

You are right! Thank you for the pointers and spotting my oversight.
Aaron too pointed some shortcomings here. I'll make sure to test
these bits more next time (LOCKDEP, hotplug, and
!CONFIG_CPUMASK_OFFSTACK)

-- 
Thanks and Regards,
Prateek

> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bca3db5d0ac0..818d4769ec55 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8562,19 +8562,18 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>   static inline void update_overloaded_mask(int cpu, bool contains_pushable)
>   {
>          struct sched_domain_shared *sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> -       cpumask_var_t overloaded_mask;
> 
>          if (!sd_share)
>                  return;
> 
> -       overloaded_mask = sd_share->overloaded_mask;
> -       if (!overloaded_mask)
> +       if (!sd_share->overloaded_mask)
>                  return;
> 
> +       guard(rcu)();
>          if (contains_pushable)
> -               cpumask_set_cpu(cpu, overloaded_mask);
> +               cpumask_set_cpu(cpu, sd_share->overloaded_mask);
>          else
> -               cpumask_clear_cpu(cpu, overloaded_mask);
> +               cpumask_clear_cpu(cpu, sd_share->overloaded_mask);
>   }
> 
>>   static inline bool fair_push_task(struct task_struct *p)
>>   {
>>       if (!task_on_rq_queued(p))
>> @@ -8606,11 +8624,17 @@ static inline void fair_queue_pushable_tasks(struct rq *rq)
>>   static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p)
>>   {
>>       plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
>> +
>> +    if (!has_pushable_tasks(rq))
>> +        update_overloaded_mask(rq->cpu, false);
>>   }
>>   static void fair_add_pushable_task(struct rq *rq, struct task_struct *p)
>>   {
>>       if (fair_push_task(p)) {
>> +        if (!has_pushable_tasks(rq))
>> +            update_overloaded_mask(rq->cpu, true);
>> +
>>           plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
>>           plist_node_init(&p->pushable_tasks, p->prio);
>>           plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/5] sched/fair: Update overloaded mask in presence of pushable task
  2025-04-21  5:54     ` K Prateek Nayak
@ 2025-04-21  8:03       ` Shrikanth Hegde
  2025-04-21  8:54         ` K Prateek Nayak
  0 siblings, 1 reply; 21+ messages in thread
From: Shrikanth Hegde @ 2025-04-21  8:03 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal



On 4/21/25 11:24, K Prateek Nayak wrote:
> Hello Shrikanth,
> 
> On 4/21/2025 10:50 AM, Shrikanth Hegde wrote:
>>
>>
>> On 4/9/25 16:45, K Prateek Nayak wrote:
>>
>> Hi Prateek. Feel free to cc me in the future versions.
> 
> 
> Will do! Thank you for taking a look at the series.
> 
>> This seems interesting way if it can get us rid of newidle balance 
>> overheads.
>>
>>> In presence of pushable tasks on the CPU, set it on the newly introduced
>>> "overloaded+mask" in sched_domain_shared struct. This will be used by
>>> the newidle balance to limit the scanning to these overloaded CPUs since
>>> they contain tasks that could be run on the newly idle target.
>>>
>>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>> ---
>>>   kernel/sched/fair.c | 24 ++++++++++++++++++++++++
>>>   1 file changed, 24 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 98d3ed2078cd..834fcdd15cac 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8559,6 +8559,24 @@ static int find_energy_efficient_cpu(struct 
>>> task_struct *p, int prev_cpu)
>>>       return target;
>>>   }
>>> +static inline void update_overloaded_mask(int cpu, bool 
>>> contains_pushable)
>>> +{
>>> +    struct sched_domain_shared *sd_share = 
>>> rcu_dereference(per_cpu(sd_llc_shared, cpu));
>>> +    cpumask_var_t overloaded_mask;
>>> +
>>> +    if (!sd_share)
>>> +        return;
>>> +
>>> +    overloaded_mask = sd_share->overloaded_mask;
>>> +    if (!overloaded_mask)
>>> +        return;
>>> +
>>> +    if (contains_pushable)
>>> +        cpumask_set_cpu(cpu, overloaded_mask);
>>> +    else
>>> +        cpumask_clear_cpu(cpu, overloaded_mask);
>>> +}
>>> +
>
>> I was getting below error when compiling. Noticed that overloaded_mask 
>> is a local update and it wouldn't
>> update the actual overloaded_mask.
> 
> Interesting! Question: Do you have "CONFIG_CPUMASK_OFFSTACK" enabled in
> your config? (me makes a note to test this too in the next iteration)
> Looking at the arch specific Kconfigs, there is a slight difference in
> how this is toggled on x86 vs powerpc and I'm wondering if that is why
> I haven't seen this warning in my testing.
> 

Yes, that's the reason you didn't run into.
for me, CONFIG_CPUMASK_OFFSTACK is not set.

>>
>> Compilation Error:
>> kernel/sched/fair.c: In function ‘update_overloaded_mask’:
>> kernel/sched/fair.c:8570:25: error: assignment to expression with 
>> array type
>>   8570 |         overloaded_mask = sd_share->overloaded_mask;
>>        |                         ^
>> kernel/sched/fair.c:8571:13: warning: the address of ‘overloaded_mask’ 
>> will always evaluate as ‘true’ [-Waddress]
>>   8571 |         if (!overloaded_mask)
>>
>>
>>
>> Made the below change. Also you would need rcu protection for sd_share 
>> i think.

Or you need to use like below. This also works (Not tested on x86)

struct cpumask*  overloaded_mask;

> 
> You are right! Thank you for the pointers and spotting my oversight.
> Aaron too pointed some shortcomings here. I'll make sure to test
> these bits more next time (LOCKDEP, hotplug, and
> !CONFIG_CPUMASK_OFFSTACK)
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 3/5] sched/fair: Update overloaded mask in presence of pushable task
  2025-04-21  8:03       ` Shrikanth Hegde
@ 2025-04-21  8:54         ` K Prateek Nayak
  0 siblings, 0 replies; 21+ messages in thread
From: K Prateek Nayak @ 2025-04-21  8:54 UTC (permalink / raw)
  To: Shrikanth Hegde
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Gautham R. Shenoy, Swapnil Sapkal

Hello Shrikanth,

On 4/21/2025 1:33 PM, Shrikanth Hegde wrote:
>>> I was getting below error when compiling. Noticed that overloaded_mask is a local update and it wouldn't
>>> update the actual overloaded_mask.
>>
>> Interesting! Question: Do you have "CONFIG_CPUMASK_OFFSTACK" enabled in
>> your config? (me makes a note to test this too in the next iteration)
>> Looking at the arch specific Kconfigs, there is a slight difference in
>> how this is toggled on x86 vs powerpc and I'm wondering if that is why
>> I haven't seen this warning in my testing.
>>
> 
> Yes, that's the reason you didn't run into.
> for me, CONFIG_CPUMASK_OFFSTACK is not set.

Thank you for confirming!

> 
>>>
>>> Compilation Error:
>>> kernel/sched/fair.c: In function ‘update_overloaded_mask’:
>>> kernel/sched/fair.c:8570:25: error: assignment to expression with array type
>>>   8570 |         overloaded_mask = sd_share->overloaded_mask;
>>>        |                         ^
>>> kernel/sched/fair.c:8571:13: warning: the address of ‘overloaded_mask’ will always evaluate as ‘true’ [-Waddress]
>>>   8571 |         if (!overloaded_mask)
>>>
>>>
>>>
>>> Made the below change. Also you would need rcu protection for sd_share i think.
> 
> Or you need to use like below. This also works (Not tested on x86)
> 
> struct cpumask*  overloaded_mask;

Ack! Perhaps that is the way to go. I'll take some inspiration from
other cpumask usage in kernel and adjust this accordingly. Thanks a ton
for testing.

> 
>>
>> You are right! Thank you for the pointers and spotting my oversight.
>> Aaron too pointed some shortcomings here. I'll make sure to test
>> these bits more next time (LOCKDEP, hotplug, and
>> !CONFIG_CPUMASK_OFFSTACK)

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-04-21  8:54 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-09 11:15 [RFC PATCH 0/5] K Prateek Nayak
2025-04-09 11:15 ` [RFC PATCH 1/5] sched/fair: Add push task framework K Prateek Nayak
2025-04-10 10:00   ` Peter Zijlstra
2025-04-10 15:25     ` K Prateek Nayak
2025-04-09 11:15 ` [RFC PATCH 2/5] sched/fair: Introduce overloaded_mask in sched_domain_shared K Prateek Nayak
2025-04-10 10:03   ` Peter Zijlstra
2025-04-10 15:26     ` K Prateek Nayak
2025-04-09 11:15 ` [RFC PATCH 3/5] sched/fair: Update overloaded mask in presence of pushable task K Prateek Nayak
2025-04-14  2:28   ` Aaron Lu
2025-04-14  3:13     ` K Prateek Nayak
2025-04-21  5:20   ` Shrikanth Hegde
2025-04-21  5:54     ` K Prateek Nayak
2025-04-21  8:03       ` Shrikanth Hegde
2025-04-21  8:54         ` K Prateek Nayak
2025-04-09 11:15 ` [RFC PATCH 4/5] sched/fair: Rework inter-NUMA newidle balancing K Prateek Nayak
2025-04-10 10:14   ` Peter Zijlstra
2025-04-10 15:27     ` K Prateek Nayak
2025-04-09 11:15 ` [RFC PATCH 5/5] sched/fair: Proactive idle balance using push mechanism K Prateek Nayak
2025-04-10 10:29   ` Peter Zijlstra
2025-04-10 15:37     ` K Prateek Nayak
2025-04-09 11:17 ` [RFC PATCH 0/5] sched/fair: Idle and newidle balancing " K Prateek Nayak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox