public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 1/2] sched: proxy-exec: Fix tasks being left unpushable from proxy_tag_curr()
@ 2026-03-04  6:38 John Stultz
  2026-03-04  6:38 ` [RFC][PATCH 2/2] sched: proxy-exec: Add allow/prevent_migration hooks in the sched classes for proxy_tag_curr John Stultz
  0 siblings, 1 reply; 9+ messages in thread
From: John Stultz @ 2026-03-04  6:38 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Johannes Weiner, Steven Rostedt, Ben Segall, Mel Gorman,
	Joel Fernandes, Qais Yousef, Xuewen Yan, Suleiman Souhlal,
	kuyo chang, hupu, zhidao su, soolaugust, kernel-team

With proxy-execution, when we are running a lock owner on behalf
of a waiting donor, we call proxy_tag_curr() to ensure that both
the selected donor task *and* the owner we will run are taken
off the pushable lists. This avoids crashes where the running
task gets migrated away because only the donor was removed from
the pushable list.

However, K Prateek noticed that while we take the task off of
the pushable list, we don't actually return it to the pushable
list when we are done running it as a proxy on behalf of a donor
task.

Fix this in __schedule() by calling proxy_tag_curr() again on
the previously run task once we have switched the rq->curr
value. This will call dequeue/enqueue again which will allow
the task to be re-evaluated for being added to the pushable
list in the class scheduler.

Further optimizations to get rid of the dequeue/enqueue calls
for something more focused will follow.

NOTE: K Prateek also suggested that we should re-evaluate if
balance callbacks should be set when we allow prev to be
re-added to the pushable list, as pick_next_task() might have
seen no pushable tasks. I think this is a good idea, but I've
not yet addressed this.

Fixes: be39617e38e0 ("sched: Fix proxy/current (push,pull)ability")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@amd.com/
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: zhidao su <suzhidao@xiaomi.com>
Cc: soolaugust@gmail.com
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7f77c165a6e0..55bafb1585eca 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6760,7 +6760,7 @@ static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
  */
 static void __sched notrace __schedule(int sched_mode)
 {
-	struct task_struct *prev, *next;
+	struct task_struct *prev, *next, *prev_donor;
 	/*
 	 * On PREEMPT_RT kernel, SM_RTLOCK_WAIT is noted
 	 * as a preemption by schedule_debug() and RCU.
@@ -6779,7 +6779,7 @@ static void __sched notrace __schedule(int sched_mode)
 	cpu = smp_processor_id();
 	rq = cpu_rq(cpu);
 	prev = rq->curr;
-
+	prev_donor = rq->donor;
 	schedule_debug(prev, preempt);
 
 	if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
@@ -6873,6 +6873,8 @@ static void __sched notrace __schedule(int sched_mode)
 
 		if (!task_current_donor(rq, next))
 			proxy_tag_curr(rq, next);
+		if (!(!preempt && prev_state) && prev != prev_donor)
+			proxy_tag_curr(rq, prev);
 
 		/*
 		 * The membarrier system call requires each architecture
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC][PATCH 2/2] sched: proxy-exec: Add allow/prevent_migration hooks in the sched classes for proxy_tag_curr
  2026-03-04  6:38 [RFC][PATCH 1/2] sched: proxy-exec: Fix tasks being left unpushable from proxy_tag_curr() John Stultz
@ 2026-03-04  6:38 ` John Stultz
  2026-03-04 13:18   ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: John Stultz @ 2026-03-04  6:38 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, zhidao su, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Johannes Weiner, Steven Rostedt, Ben Segall, Mel Gorman,
	Joel Fernandes, Qais Yousef, Xuewen Yan, K Prateek Nayak,
	Suleiman Souhlal, kuyo chang, hupu, soolaugust, kernel-team

Currently proxy_tag_curr() calls task dequeue and task enqueue
functions (on lock owners we want to run on behalf of a waiting
donor) in order to force the owner task we are going to run
to be removed from any sched_class pushable lists. This avoids
crashes that could happen if the task being run ended up being
migrated to another cpu.

The dequeue/enqueue pair is sort of an ugly hack though, so
replace these with more focused prevent_migration and
allow_migration function pointers to prevent and then to later
allow it to be migratable after the blocked donor has finished
running it on its behalf.

This patch was inspired from discussion around a similar RFC
patch by: zhidao su <suzhidao@xiaomi.com>

Which highlighted the inefficiency of the dequeue/enqueue pair:
 https://lore.kernel.org/lkml/20260303115718.278608-1-soolaugust@gmail.com/

Reported-by: zhidao su <suzhidao@xiaomi.com>
Closes: https://lore.kernel.org/lkml/20260303115718.278608-1-soolaugust@gmail.com/
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: zhidao su <suzhidao@xiaomi.com>
Cc: soolaugust@gmail.com
Cc: kernel-team@android.com
---
 kernel/sched/core.c     | 18 +++++++++++++-----
 kernel/sched/deadline.c | 34 ++++++++++++++++++++++++++--------
 kernel/sched/rt.c       | 28 +++++++++++++++++++++++-----
 kernel/sched/sched.h    |  3 +++
 4 files changed, 65 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 55bafb1585eca..174a3177a3a6b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6712,11 +6712,19 @@ static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
 	 * However, the chosen/donor task *and* the mutex owner form an
 	 * atomic pair wrt push/pull.
 	 *
-	 * Make sure owner we run is not pushable. Unfortunately we can
-	 * only deal with that by means of a dequeue/enqueue cycle. :-/
+	 * Make sure owner we run is not pushable.
 	 */
-	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
-	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
+	if (owner->sched_class->prevent_migration)
+		owner->sched_class->prevent_migration(rq, owner);
+}
+
+static inline void proxy_untag_prev(struct rq *rq, struct task_struct *prev)
+{
+	if (!sched_proxy_exec())
+		return;
+
+	if (prev->sched_class->allow_migration)
+		prev->sched_class->allow_migration(rq, prev);
 }
 
 /*
@@ -6874,7 +6882,7 @@ static void __sched notrace __schedule(int sched_mode)
 		if (!task_current_donor(rq, next))
 			proxy_tag_curr(rq, next);
 		if (!(!preempt && prev_state) && prev != prev_donor)
-			proxy_tag_curr(rq, prev);
+			proxy_untag_prev(rq, prev);
 
 		/*
 		 * The membarrier system call requires each architecture
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d08b004293234..6bd6f0682e6c6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2289,6 +2289,28 @@ static void dequeue_dl_entity(struct sched_dl_entity *dl_se, int flags)
 		task_non_contending(dl_se, true);
 }
 
+static inline void __allow_migration_dl(struct rq *rq, struct task_struct *p)
+{
+	if (dl_server(&p->dl))
+		return;
+
+	if (task_is_blocked(p))
+		return;
+
+	if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
+		enqueue_pushable_dl_task(rq, p);
+}
+
+static void allow_migration_dl(struct rq *rq, struct task_struct *p)
+{
+	__allow_migration_dl(rq, p);
+}
+
+static void prevent_migration_dl(struct rq *rq, struct task_struct *p)
+{
+	dequeue_pushable_dl_task(rq, p);
+}
+
 static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (is_dl_boosted(&p->dl)) {
@@ -2339,14 +2361,7 @@ static void enqueue_task_dl(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_dl_entity(&p->dl, flags);
 
-	if (dl_server(&p->dl))
-		return;
-
-	if (task_is_blocked(p))
-		return;
-
-	if (!task_current(rq, p) && !p->dl.dl_throttled && p->nr_cpus_allowed > 1)
-		enqueue_pushable_dl_task(rq, p);
+	__allow_migration_dl(rq, p);
 }
 
 static bool dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags)
@@ -3408,6 +3423,9 @@ DEFINE_SCHED_CLASS(dl) = {
 	.dequeue_task		= dequeue_task_dl,
 	.yield_task		= yield_task_dl,
 
+	.allow_migration	= allow_migration_dl,
+	.prevent_migration	= prevent_migration_dl,
+
 	.wakeup_preempt		= wakeup_preempt_dl,
 
 	.pick_task		= pick_task_dl,
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f69e1f16d9238..90f1c62e1f827 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1424,6 +1424,25 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
 	enqueue_top_rt_rq(&rq->rt);
 }
 
+static void __allow_migration_rt(struct rq *rq, struct task_struct *p)
+{
+	if (task_is_blocked(p))
+		return;
+
+	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+		enqueue_pushable_task(rq, p);
+}
+
+static void allow_migration_rt(struct rq *rq, struct task_struct *p)
+{
+	__allow_migration_rt(rq, p);
+}
+
+static void prevent_migration_rt(struct rq *rq, struct task_struct *p)
+{
+	dequeue_pushable_task(rq, p);
+}
+
 /*
  * Adding/removing a task to/from a priority array:
  */
@@ -1440,11 +1459,7 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
 
 	enqueue_rt_entity(rt_se, flags);
 
-	if (task_is_blocked(p))
-		return;
-
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
-		enqueue_pushable_task(rq, p);
+	__allow_migration_rt(rq, p);
 }
 
 static bool dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -2583,6 +2598,9 @@ DEFINE_SCHED_CLASS(rt) = {
 	.dequeue_task		= dequeue_task_rt,
 	.yield_task		= yield_task_rt,
 
+	.allow_migration	= allow_migration_rt,
+	.prevent_migration	= prevent_migration_rt,
+
 	.wakeup_preempt		= wakeup_preempt_rt,
 
 	.pick_task		= pick_task_rt,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 43bbf0693cca4..5c3eb8b28ebd3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2575,6 +2575,9 @@ struct sched_class {
 	 */
 	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
+	void (*allow_migration)(struct rq *rq, struct task_struct *p);
+	void (*prevent_migration)(struct rq *rq, struct task_struct *p);
+
 	/*
 	 * ttwu_do_activate: rq->lock
 	 * wake_up_new_task: task_rq_lock
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] sched: proxy-exec: Add allow/prevent_migration hooks in the sched classes for proxy_tag_curr
  2026-03-04  6:38 ` [RFC][PATCH 2/2] sched: proxy-exec: Add allow/prevent_migration hooks in the sched classes for proxy_tag_curr John Stultz
@ 2026-03-04 13:18   ` Peter Zijlstra
  2026-03-05  4:02     ` K Prateek Nayak
  2026-03-05  7:31     ` [RFC][PATCH 2/2] sched: proxy-exec: Add allow/prevent_migration hooks in the sched classes for proxy_tag_curr John Stultz
  0 siblings, 2 replies; 9+ messages in thread
From: Peter Zijlstra @ 2026-03-04 13:18 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, zhidao su, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Johannes Weiner,
	Steven Rostedt, Ben Segall, Mel Gorman, Joel Fernandes,
	Qais Yousef, Xuewen Yan, K Prateek Nayak, Suleiman Souhlal,
	kuyo chang, hupu, soolaugust, kernel-team

On Wed, Mar 04, 2026 at 06:38:10AM +0000, John Stultz wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 55bafb1585eca..174a3177a3a6b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6712,11 +6712,19 @@ static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
>  	 * However, the chosen/donor task *and* the mutex owner form an
>  	 * atomic pair wrt push/pull.
>  	 *
> -	 * Make sure owner we run is not pushable. Unfortunately we can
> -	 * only deal with that by means of a dequeue/enqueue cycle. :-/
> +	 * Make sure owner we run is not pushable.
>  	 */
> -	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> -	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> +	if (owner->sched_class->prevent_migration)
> +		owner->sched_class->prevent_migration(rq, owner);
> +}
> +
> +static inline void proxy_untag_prev(struct rq *rq, struct task_struct *prev)
> +{
> +	if (!sched_proxy_exec())
> +		return;
> +
> +	if (prev->sched_class->allow_migration)
> +		prev->sched_class->allow_migration(rq, prev);
>  }
>  
>  /*
> @@ -6874,7 +6882,7 @@ static void __sched notrace __schedule(int sched_mode)
>  		if (!task_current_donor(rq, next))
>  			proxy_tag_curr(rq, next);
>  		if (!(!preempt && prev_state) && prev != prev_donor)
> -			proxy_tag_curr(rq, prev);
> +			proxy_untag_prev(rq, prev);
>  
>  		/*
>  		 * The membarrier system call requires each architecture

Yeah, not a fan in this form.

I really don't think we need new class callbacks for this. Esp. not
named like this, which is quite terrible.

Note how migrate_disable() and migrate_enable() use ->set_cpus_allowed()
and are both very much about preventing and allowing migration.

Also note how set_next_task() / put_prev_task() already very much do
what you want; except they only work for the donor.

Further note that the only reason this proxy_tag_curr() thing lives
where it does is because it depends on the value of current. However if
you do this, you no longer have that constraint and then there is a much
saner place for all this.


So I think I prefer (ab)using the migrate_disable() infrastructure,
simply because it would avoid having to do an (indirect) class call
entirely -- but looking at how RT/DL handle this, I think there's bugs
there.

Specifically, something like pick_next_pushable_task() should never
return something that has ->migration_disabled set, it should continue
iterating the list until it finds one that hasn't.


Anyway, without having tested anything at all, how crazy would something
like this be?


---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b571e640372..79b606e5d7cd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2413,6 +2413,10 @@ static void migrate_disable_switch(struct rq *rq, struct task_struct *p)
 	if (likely(!p->migration_disabled))
 		return;
 
+	if ((p->migration_flags & MDF_PROXY) &&
+	    p->migration_disabled == 1)
+		return;
+
 	if (p->cpus_ptr != &p->cpus_mask)
 		return;
 
@@ -6651,11 +6655,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	struct mutex *mutex;
 
 	/* Follow blocked_on chain. */
-	for (p = donor; task_is_blocked(p); p = owner) {
-		mutex = p->blocked_on;
-		/* Something changed in the chain, so pick again */
-		if (!mutex)
-			return NULL;
+	for (p = donor; (mutex = p->blocked_on); p = owner) {
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
 		 * and ensure @owner sticks around.
@@ -6756,21 +6756,18 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 }
 #endif /* SCHED_PROXY_EXEC */
 
-static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
+static inline void set_proxy_task(struct task_struct *p)
 {
-	if (!sched_proxy_exec())
-		return;
-	/*
-	 * pick_next_task() calls set_next_task() on the chosen task
-	 * at some point, which ensures it is not push/pullable.
-	 * However, the chosen/donor task *and* the mutex owner form an
-	 * atomic pair wrt push/pull.
-	 *
-	 * Make sure owner we run is not pushable. Unfortunately we can
-	 * only deal with that by means of a dequeue/enqueue cycle. :-/
-	 */
-	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
-	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
+	WARN_ON_ONCE(p->migration_flags & MDF_PROXY);
+	p->migration_flags |= MDF_PROXY;
+	p->migration_disabled++;
+}
+
+static inline void put_proxy_task(struct task_struct *p)
+{
+	WARN_ON_ONCE(!(p->migration_flags & MDF_PROXY));
+	p->migration_flags &= ~MDF_PROXY;
+	p->migration_disabled--;
 }
 
 /*
@@ -6900,14 +6897,22 @@ static void __sched notrace __schedule(int sched_mode)
 
 pick_again:
 	next = pick_next_task(rq, rq->donor, &rf);
-	rq_set_donor(rq, next);
 	rq->next_class = next->sched_class;
-	if (unlikely(task_is_blocked(next))) {
-		next = find_proxy_task(rq, next, &rf);
-		if (!next)
-			goto pick_again;
-		if (next == rq->idle)
-			goto keep_resched;
+	if (sched_proxy_exec()) {
+		if (prev != rq->donor)
+			put_proxy_task(prev);
+
+		rq_set_donor(rq, next);
+		if (next->blocked_on) {
+			next = find_proxy_task(rq, next, &rf);
+			if (!next)
+				goto pick_again;
+			if (next == rq->idle)
+				goto keep_resched;
+		}
+
+		if (next != rq->donor)
+			set_proxy_task(next);
 	}
 picked:
 	clear_tsk_need_resched(prev);
@@ -6924,9 +6929,6 @@ static void __sched notrace __schedule(int sched_mode)
 		 */
 		RCU_INIT_POINTER(rq->curr, next);
 
-		if (!task_current_donor(rq, next))
-			proxy_tag_curr(rq, next);
-
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -6960,10 +6962,6 @@ static void __sched notrace __schedule(int sched_mode)
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
-		/* In case next was already curr but just got blocked_donor */
-		if (!task_current_donor(rq, next))
-			proxy_tag_curr(rq, next);
-
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq, NULL);
 		hrtick_schedule_exit(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fd36ae390520..8222e108be73 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1368,6 +1368,7 @@ static inline int cpu_of(struct rq *rq)
 }
 
 #define MDF_PUSH		0x01
+#define MDF_PROXY		0x02
 
 static inline bool is_migration_disabled(struct task_struct *p)
 {

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] sched: proxy-exec: Add allow/prevent_migration hooks in the sched classes for proxy_tag_curr
  2026-03-04 13:18   ` Peter Zijlstra
@ 2026-03-05  4:02     ` K Prateek Nayak
  2026-03-05 14:46       ` Peter Zijlstra
  2026-03-05  7:31     ` [RFC][PATCH 2/2] sched: proxy-exec: Add allow/prevent_migration hooks in the sched classes for proxy_tag_curr John Stultz
  1 sibling, 1 reply; 9+ messages in thread
From: K Prateek Nayak @ 2026-03-05  4:02 UTC (permalink / raw)
  To: Peter Zijlstra, John Stultz
  Cc: LKML, zhidao su, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Johannes Weiner,
	Steven Rostedt, Ben Segall, Mel Gorman, Joel Fernandes,
	Qais Yousef, Xuewen Yan, Suleiman Souhlal, kuyo chang, hupu,
	soolaugust, kernel-team

Hello Peter,

On 3/4/2026 6:48 PM, Peter Zijlstra wrote:
> +static inline void set_proxy_task(struct task_struct *p)
>  {
> -	if (!sched_proxy_exec())
> -		return;
> -	/*
> -	 * pick_next_task() calls set_next_task() on the chosen task
> -	 * at some point, which ensures it is not push/pullable.
> -	 * However, the chosen/donor task *and* the mutex owner form an
> -	 * atomic pair wrt push/pull.
> -	 *
> -	 * Make sure owner we run is not pushable. Unfortunately we can
> -	 * only deal with that by means of a dequeue/enqueue cycle. :-/
> -	 */
> -	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> -	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> +	WARN_ON_ONCE(p->migration_flags & MDF_PROXY);
> +	p->migration_flags |= MDF_PROXY;
> +	p->migration_disabled++;
> +}
> +
> +static inline void put_proxy_task(struct task_struct *p)
> +{
> +	WARN_ON_ONCE(!(p->migration_flags & MDF_PROXY));
> +	p->migration_flags &= ~MDF_PROXY;
> +	p->migration_disabled--;

Note: I'm not too familiar with the set_affinity bits so my
understanding might be all wrong but ...

Doesn't the set_affinity bits have a completion based wait for tasks
that are migration disabled? If we have a case like:

       P0                                             P1
       ==                                             ==

  migrate_disable()
    p->migration_disabled++; /* 1 */
  ...
  mutex_lock()
  ...

  /* preempted */
  /* proxied */
  set_proxy_task(P0)
    p->migration_disabled++; /* 2 */
  ...
  /* Continues running */
                                        set_cpus_allowed_ptr(P0)
                                          /* Task CPU not in the new mask. */
                                          affine_move_task()
                                            /*
                                             * blocks as per the comment
                                             * above affine_move_task().
                                             */
  migrate_enable()
   if (p->migration_disabled > 1)
     p->migration_disabled--; /* 1 */
     return;
  ...
  mutex_unlock();
  /* Goes into schedule. */
  put_proxy_task(P0)
    p->migration_disabled--; /* 0 */

  /* !!! Who does the migration + wakeup? !!! */


Isn't it up to the last migrate_enable() (or in this case,
put_proxy_task()) to schedule in the stopper and push the prev to
another CPU? Or is it handled in some other way?

>  }

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] sched: proxy-exec: Add allow/prevent_migration hooks in the sched classes for proxy_tag_curr
  2026-03-04 13:18   ` Peter Zijlstra
  2026-03-05  4:02     ` K Prateek Nayak
@ 2026-03-05  7:31     ` John Stultz
  1 sibling, 0 replies; 9+ messages in thread
From: John Stultz @ 2026-03-05  7:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, zhidao su, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Johannes Weiner,
	Steven Rostedt, Ben Segall, Mel Gorman, Joel Fernandes,
	Qais Yousef, Xuewen Yan, K Prateek Nayak, Suleiman Souhlal,
	kuyo chang, hupu, soolaugust, kernel-team

On Wed, Mar 4, 2026 at 5:18 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Mar 04, 2026 at 06:38:10AM +0000, John Stultz wrote:
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 55bafb1585eca..174a3177a3a6b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6712,11 +6712,19 @@ static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
> >        * However, the chosen/donor task *and* the mutex owner form an
> >        * atomic pair wrt push/pull.
> >        *
> > -      * Make sure owner we run is not pushable. Unfortunately we can
> > -      * only deal with that by means of a dequeue/enqueue cycle. :-/
> > +      * Make sure owner we run is not pushable.
> >        */
> > -     dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> > -     enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> > +     if (owner->sched_class->prevent_migration)
> > +             owner->sched_class->prevent_migration(rq, owner);
> > +}
> > +
> > +static inline void proxy_untag_prev(struct rq *rq, struct task_struct *prev)
> > +{
> > +     if (!sched_proxy_exec())
> > +             return;
> > +
> > +     if (prev->sched_class->allow_migration)
> > +             prev->sched_class->allow_migration(rq, prev);
> >  }
> >
> >  /*
> > @@ -6874,7 +6882,7 @@ static void __sched notrace __schedule(int sched_mode)
> >               if (!task_current_donor(rq, next))
> >                       proxy_tag_curr(rq, next);
> >               if (!(!preempt && prev_state) && prev != prev_donor)
> > -                     proxy_tag_curr(rq, prev);
> > +                     proxy_untag_prev(rq, prev);
> >
> >               /*
> >                * The membarrier system call requires each architecture
>
> Yeah, not a fan in this form.
>
> I really don't think we need new class callbacks for this. Esp. not
> named like this, which is quite terrible.

Yeah, apologies, I was cringing a bit on the prevent/allow_migration()
names given they alias the other migration related functions, but just
wasn't feeling creative enoguh to come up with something else (we do
want to prevent rq->curr from being migrated when we're proxying).

> Note how migrate_disable() and migrate_enable() use ->set_cpus_allowed()
> and are both very much about preventing and allowing migration.
>
> Also note how set_next_task() / put_prev_task() already very much do
> what you want; except they only work for the donor.

Yeah, I like the set_proxy_task()/put_proxy_task() names *much*
better. Thank you for the suggestion!

> Further note that the only reason this proxy_tag_curr() thing lives
> where it does is because it depends on the value of current. However if
> you do this, you no longer have that constraint and then there is a much
> saner place for all this.
>
>
> So I think I prefer (ab)using the migrate_disable() infrastructure,
> simply because it would avoid having to do an (indirect) class call
> entirely -- but looking at how RT/DL handle this, I think there's bugs
> there.
>
> Specifically, something like pick_next_pushable_task() should never
> return something that has ->migration_disabled set, it should continue
> iterating the list until it finds one that hasn't.
>
>
> Anyway, without having tested anything at all, how crazy would something
> like this be?

It needed some rework to handle the pick_again looping properly as you
pointed out on irc, but also the migration_disabled/migration_flags
need to be handled on fork or else we end up with tasks that are stuck
non-migratable w/ MDF_PROXY that never gets dropped.

After working around those, I'm still sometimes hitting warnings and
issues around __set_cpus_allowed_ptr_locked() which sounds like what K
Prateek mentioned. But I'll have to dig more tomorrow on it.

thanks
-john

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] sched: proxy-exec: Add allow/prevent_migration hooks in the sched classes for proxy_tag_curr
  2026-03-05  4:02     ` K Prateek Nayak
@ 2026-03-05 14:46       ` Peter Zijlstra
  2026-03-07  1:36         ` John Stultz
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2026-03-05 14:46 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, zhidao su, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Johannes Weiner, Steven Rostedt, Ben Segall, Mel Gorman,
	Joel Fernandes, Qais Yousef, Xuewen Yan, Suleiman Souhlal,
	kuyo chang, hupu, soolaugust, kernel-team

On Thu, Mar 05, 2026 at 09:32:05AM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 3/4/2026 6:48 PM, Peter Zijlstra wrote:
> > +static inline void set_proxy_task(struct task_struct *p)
> >  {
> > -	if (!sched_proxy_exec())
> > -		return;
> > -	/*
> > -	 * pick_next_task() calls set_next_task() on the chosen task
> > -	 * at some point, which ensures it is not push/pullable.
> > -	 * However, the chosen/donor task *and* the mutex owner form an
> > -	 * atomic pair wrt push/pull.
> > -	 *
> > -	 * Make sure owner we run is not pushable. Unfortunately we can
> > -	 * only deal with that by means of a dequeue/enqueue cycle. :-/
> > -	 */
> > -	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> > -	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> > +	WARN_ON_ONCE(p->migration_flags & MDF_PROXY);
> > +	p->migration_flags |= MDF_PROXY;
> > +	p->migration_disabled++;
> > +}
> > +
> > +static inline void put_proxy_task(struct task_struct *p)
> > +{
> > +	WARN_ON_ONCE(!(p->migration_flags & MDF_PROXY));
> > +	p->migration_flags &= ~MDF_PROXY;
> > +	p->migration_disabled--;
> 
> Note: I'm not too familiar with the set_affinity bits so my
> understanding might be all wrong but ...
> 
> Doesn't the set_affinity bits have a completion based wait for tasks
> that are migration disabled? If we have a case like:
> 
>        P0                                             P1
>        ==                                             ==
> 
>   migrate_disable()
>     p->migration_disabled++; /* 1 */
>   ...
>   mutex_lock()
>   ...
> 
>   /* preempted */
>   /* proxied */
>   set_proxy_task(P0)
>     p->migration_disabled++; /* 2 */
>   ...
>   /* Continues running */
>                                         set_cpus_allowed_ptr(P0)
>                                           /* Task CPU not in the new mask. */
>                                           affine_move_task()
>                                             /*
>                                              * blocks as per the comment
>                                              * above affine_move_task().
>                                              */
>   migrate_enable()
>    if (p->migration_disabled > 1)
>      p->migration_disabled--; /* 1 */
>      return;
>   ...
>   mutex_unlock();
>   /* Goes into schedule. */
>   put_proxy_task(P0)
>     p->migration_disabled--; /* 0 */
> 
>   /* !!! Who does the migration + wakeup? !!! */
> 
> 
> Isn't it up to the last migrate_enable() (or in this case,
> put_proxy_task()) to schedule in the stopper and push the prev to
> another CPU? Or is it handled in some other way?

Indeed; so I think we can fix that by doing something like the below.
Have actual migrate_{dis,en}able {inc,dec}rement by 2 and have
this proxy thing {inc,dec} by 1.

Then have migrate_enable() ignore the low bit, such that 3->1 does the
slow-path and issues the completion.

So we only have the low bit set for the current proxy task; IOW any
non-current task will have it clear.

Then there are 3 sites that do the completion:

 - migrate_cpu_stop()
 - affine_move_task(); (1) when the mask fits the current location
 - affine_move_task(); (2) when the mask doesn't fit and the task is not
                           running

migrate_cpu_stop() is safe, because the stopper task can neither block
nor be the owner of a lock and must this exist outside of PE,
furthermore if it is runnung, no other task is running and thus the
target task cannot have the low bit set.

affine_move_task() case (1) is obviously fine.

affine_move_task() case (2) is also fine, because similar to
migrate_cpu_stop() the target task is found to not be running, and
therefore it cannot have the low bit set.

*lightbulb*... but, doesn't that mean that we don't need any of this at
all, and could simply make sure RT/DL refuse to migrate task_on_cpu(p) ?



---

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2412,8 +2412,8 @@ static inline void __migrate_enable(void
 		return;
 #endif
 
-	if (p->migration_disabled > 1) {
-		p->migration_disabled--;
+	if (p->migration_disabled > 3) {
+		p->migration_disabled -= 2;
 		return;
 	}
 
@@ -2430,7 +2430,7 @@ static inline void __migrate_enable(void
 	 * select_fallback_rq) get confused.
 	 */
 	barrier();
-	p->migration_disabled = 0;
+	p->migration_disabled -= 2;
 	this_rq_pinned()--;
 }
 
@@ -2445,13 +2445,13 @@ static inline void __migrate_disable(voi
 		 */
 		WARN_ON_ONCE((s16)p->migration_disabled < 0);
 #endif
-		p->migration_disabled++;
+		p->migration_disabled += 2;
 		return;
 	}
 
 	guard(preempt)();
 	this_rq_pinned()++;
-	p->migration_disabled = 1;
+	p->migration_disabled += 2;
 }
 #else /* !COMPILE_OFFSETS */
 static inline void __migrate_disable(void) { }
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2410,11 +2410,7 @@ static void migrate_disable_switch(struc
 		.flags     = SCA_MIGRATE_DISABLE,
 	};
 
-	if (likely(!p->migration_disabled))
-		return;
-
-	if ((p->migration_flags & MDF_PROXY) &&
-	    p->migration_disabled == 1)
+	if (likely(!(p->migration_disabled & ~1)))
 		return;
 
 	if (p->cpus_ptr != &p->cpus_mask)
@@ -6758,15 +6754,13 @@ find_proxy_task(struct rq *rq, struct ta
 
 static inline void set_proxy_task(struct task_struct *p)
 {
-	WARN_ON_ONCE(p->migration_flags & MDF_PROXY);
-	p->migration_flags |= MDF_PROXY;
+	WARN_ON_ONCE(p->migration_disabled & 1);
 	p->migration_disabled++;
 }
 
 static inline void put_proxy_task(struct task_struct *p)
 {
-	WARN_ON_ONCE(!(p->migration_flags & MDF_PROXY));
-	p->migration_flags &= ~MDF_PROXY;
+	WARN_ON_ONCE(!(p->migration_disabled & 1));
 	p->migration_disabled--;
 }
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1368,7 +1368,6 @@ static inline int cpu_of(struct rq *rq)
 }
 
 #define MDF_PUSH		0x01
-#define MDF_PROXY		0x02
 
 static inline bool is_migration_disabled(struct task_struct *p)
 {




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH 2/2] sched: proxy-exec: Add allow/prevent_migration hooks in the sched classes for proxy_tag_curr
  2026-03-05 14:46       ` Peter Zijlstra
@ 2026-03-07  1:36         ` John Stultz
  2026-03-07  7:39           ` [RFC][PATCH] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
  0 siblings, 1 reply; 9+ messages in thread
From: John Stultz @ 2026-03-07  1:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, LKML, zhidao su, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Johannes Weiner, Steven Rostedt, Ben Segall, Mel Gorman,
	Joel Fernandes, Qais Yousef, Xuewen Yan, Suleiman Souhlal,
	kuyo chang, hupu, soolaugust, kernel-team

On Thu, Mar 5, 2026 at 6:46 AM Peter Zijlstra <peterz@infradead.org> wrote:
> *lightbulb*... but, doesn't that mean that we don't need any of this at
> all, and could simply make sure RT/DL refuse to migrate task_on_cpu(p) ?
>

This does seem painfully obvious now. :)
Instead of trying to avoid putting current on the pushable list, we
can just check task_current() (since we do already check
task_on_cpu()) in task_is_pushable().

I'll switch to that and hopefully we can get rid of the whole
proxy_tag_curr gunk.

thanks
-john

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC][PATCH] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-07  1:36         ` John Stultz
@ 2026-03-07  7:39           ` John Stultz
  2026-03-07  9:40             ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: John Stultz @ 2026-03-07  7:39 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Peter Zijlstra, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Suleiman Souhlal,
	Johannes Weiner, Joel Fernandes, Qais Yousef, zhidao su,
	Xuewen Yan, kuyo chang, hupu, kernel-team

With proxy-execution, the scheduler selects the donor, but for
blocked donors, we end up running the lock owner.

This caused some complexity, because the class schedulers make
sure to remove the task they pick from their pushable task
lists, which prevents the donor from being migrated, but there
wasn't then anything to prevent rq->curr from being migrated
if rq->curr != rq->donor.

This was sort of hacked around by calling proxy_tag_curr() on
the rq->curr task if we were running something other then the
donor. proxy_tag_curr() did a dequeue/enqueue pair on the
rq->curr task, allowing the class schedulers to remove it from
their pushable list.

The dequeue/enqueue pair was wasteful, and additionally K Prateek
highlighted that we didn't properly undo things when we stopped
proxying, leaving the lock owner off the pushable list.

After some alternative approaches were considered, Peter
suggested just having the RT/DL classes just avoid migrating
when task_on_cpu().

So rework pick_next_pushable_dl_task() and the rt
pick_next_pushable_task() functions so that they skip over the
first pushable task if it is on_cpu.

Then just drop all of the proxy_tag_curr() logic.

Fixes: be39617e38e0 ("sched: Fix proxy/current (push,pull)ability")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@amd.com/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
CC: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: zhidao su <suzhidao@xiaomi.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c     | 24 ------------------------
 kernel/sched/deadline.c | 16 ++++++++++++++--
 kernel/sched/rt.c       | 15 ++++++++++++---
 3 files changed, 26 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6960c1bfc741a..88db2b2bf3d46 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6725,23 +6725,6 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 }
 #endif /* SCHED_PROXY_EXEC */
 
-static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
-{
-	if (!sched_proxy_exec())
-		return;
-	/*
-	 * pick_next_task() calls set_next_task() on the chosen task
-	 * at some point, which ensures it is not push/pullable.
-	 * However, the chosen/donor task *and* the mutex owner form an
-	 * atomic pair wrt push/pull.
-	 *
-	 * Make sure owner we run is not pushable. Unfortunately we can
-	 * only deal with that by means of a dequeue/enqueue cycle. :-/
-	 */
-	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
-	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
-}
-
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6891,9 +6874,6 @@ static void __sched notrace __schedule(int sched_mode)
 		 */
 		RCU_INIT_POINTER(rq->curr, next);
 
-		if (!task_current_donor(rq, next))
-			proxy_tag_curr(rq, next);
-
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -6928,10 +6908,6 @@ static void __sched notrace __schedule(int sched_mode)
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
-		/* In case next was already curr but just got blocked_donor */
-		if (!task_current_donor(rq, next))
-			proxy_tag_curr(rq, next);
-
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq);
 		raw_spin_rq_unlock_irq(rq);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index c4402542ef44f..2cf2c1ac83493 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2556,12 +2556,24 @@ static int find_later_rq(struct task_struct *task)
 
 static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 {
-	struct task_struct *p;
+	struct task_struct *p = NULL;
+	struct rb_node *next_node;
 
 	if (!has_pushable_dl_tasks(rq))
 		return NULL;
 
-	p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
+	next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root);
+	while (next_node) {
+		p = __node_2_pdl(next_node);
+		/* make sure task isn't on_cpu (possible with proxy-exec) */
+		if (!task_on_cpu(rq, p))
+			break;
+
+		next_node = rb_next(next_node);
+	}
+
+	if (!p)
+		return NULL;
 
 	WARN_ON_ONCE(rq->cpu != task_cpu(p));
 	WARN_ON_ONCE(task_current(rq, p));
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index fb07dcfc60a24..5dcbe776aadd2 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1847,13 +1847,22 @@ static int find_lowest_rq(struct task_struct *task)
 
 static struct task_struct *pick_next_pushable_task(struct rq *rq)
 {
-	struct task_struct *p;
+	struct plist_head *head = &rq->rt.pushable_tasks;
+	struct task_struct *i, *p = NULL;
 
 	if (!has_pushable_tasks(rq))
 		return NULL;
 
-	p = plist_first_entry(&rq->rt.pushable_tasks,
-			      struct task_struct, pushable_tasks);
+	plist_for_each_entry(i, head, pushable_tasks) {
+		/* make sure task isn't on_cpu (possible with proxy-exec) */
+		if (!task_on_cpu(rq, i)) {
+			p = i;
+			break;
+		}
+	}
+
+	if (!p)
+		return NULL;
 
 	BUG_ON(rq->cpu != task_cpu(p));
 	BUG_ON(task_current(rq, p));
-- 
2.53.0.473.g4a7958ca14-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC][PATCH] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-07  7:39           ` [RFC][PATCH] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
@ 2026-03-07  9:40             ` Peter Zijlstra
  0 siblings, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2026-03-07  9:40 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, K Prateek Nayak, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Suleiman Souhlal, Johannes Weiner,
	Joel Fernandes, Qais Yousef, zhidao su, Xuewen Yan, kuyo chang,
	hupu, kernel-team

On Sat, Mar 07, 2026 at 07:39:29AM +0000, John Stultz wrote:
> With proxy-execution, the scheduler selects the donor, but for
> blocked donors, we end up running the lock owner.
> 
> This caused some complexity, because the class schedulers make
> sure to remove the task they pick from their pushable task
> lists, which prevents the donor from being migrated, but there
> wasn't then anything to prevent rq->curr from being migrated
> if rq->curr != rq->donor.
> 
> This was sort of hacked around by calling proxy_tag_curr() on
> the rq->curr task if we were running something other then the
> donor. proxy_tag_curr() did a dequeue/enqueue pair on the
> rq->curr task, allowing the class schedulers to remove it from
> their pushable list.
> 
> The dequeue/enqueue pair was wasteful, and additionally K Prateek
> highlighted that we didn't properly undo things when we stopped
> proxying, leaving the lock owner off the pushable list.
> 
> After some alternative approaches were considered, Peter
> suggested just having the RT/DL classes just avoid migrating
> when task_on_cpu().
> 
> So rework pick_next_pushable_dl_task() and the rt
> pick_next_pushable_task() functions so that they skip over the
> first pushable task if it is on_cpu.
> 
> Then just drop all of the proxy_tag_curr() logic.
> 
> Fixes: be39617e38e0 ("sched: Fix proxy/current (push,pull)ability")
> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@amd.com/
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: John Stultz <jstultz@google.com>

Right, that works for me ;-)

Sone bits I also had in my 'patch' that didn't make it, and quite
frankly don't belong in the same patch anyway, is the below.

Compilers are really bad (as in they utterly refuse) optimizing (even
when marked with __pure) the static branch things, and will happily emit
multiple identical in a row.

So pull out the one obvious sched_proxy_exec() branch in __schedule()
and remove some of the 'implicit' ones in that path.

Hmm?

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6597,11 +6597,7 @@ find_proxy_task(struct rq *rq, struct ta
 	struct mutex *mutex;
 
 	/* Follow blocked_on chain. */
-	for (p = donor; task_is_blocked(p); p = owner) {
-		mutex = p->blocked_on;
-		/* Something changed in the chain, so pick again */
-		if (!mutex)
-			return NULL;
+	for (p = donor; (mutex = p->blocked_on); p = owner) {
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
 		 * and ensure @owner sticks around.
@@ -6829,14 +6825,16 @@ static void __sched notrace __schedule(i
 
 pick_again:
 	next = pick_next_task(rq, rq->donor, &rf);
-	rq_set_donor(rq, next);
 	rq->next_class = next->sched_class;
-	if (unlikely(task_is_blocked(next))) {
-		next = find_proxy_task(rq, next, &rf);
-		if (!next)
-			goto pick_again;
-		if (next == rq->idle)
-			goto keep_resched;
+	if (sched_proxy_exec()) {
+		rq_set_donor(rq, next);
+		if (p->blocked_on) {
+			next = find_proxy_task(rq, next, &rf);
+			if (!next)
+				goto pick_again;
+			if (next == rq->idle)
+				goto keep_resched;
+		}
 	}
 picked:
 	clear_tsk_need_resched(prev);
@@ -6886,10 +6884,6 @@ static void __sched notrace __schedule(i
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
-		/* In case next was already curr but just got blocked_donor */
-		if (!task_current_donor(rq, next))
-			proxy_tag_curr(rq, next);
-
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq, NULL);
 		raw_spin_rq_unlock_irq(rq);

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-03-07  9:40 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-04  6:38 [RFC][PATCH 1/2] sched: proxy-exec: Fix tasks being left unpushable from proxy_tag_curr() John Stultz
2026-03-04  6:38 ` [RFC][PATCH 2/2] sched: proxy-exec: Add allow/prevent_migration hooks in the sched classes for proxy_tag_curr John Stultz
2026-03-04 13:18   ` Peter Zijlstra
2026-03-05  4:02     ` K Prateek Nayak
2026-03-05 14:46       ` Peter Zijlstra
2026-03-07  1:36         ` John Stultz
2026-03-07  7:39           ` [RFC][PATCH] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
2026-03-07  9:40             ` Peter Zijlstra
2026-03-05  7:31     ` [RFC][PATCH 2/2] sched: proxy-exec: Add allow/prevent_migration hooks in the sched classes for proxy_tag_curr John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox