public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v25 0/9] Simple Donor Migration for Proxy Execution
@ 2026-03-13  2:30 John Stultz
  2026-03-13  2:30 ` [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
                   ` (8 more replies)
  0 siblings, 9 replies; 38+ messages in thread
From: John Stultz @ 2026-03-13  2:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Hey All,

Yet another iteration on the next chunk of the Proxy Exec
series: Simple Donor Migration

This is just the next step for Proxy Execution, to allow us to
migrate blocked donors across runqueues to boost remote lock
owners.

As always, I’m trying to submit this larger work in smallish
digestible pieces, so in this portion of the series, I’m only
submitting for review and consideration some recent fixups, and
the logic that allows us to do donor(blocked waiter) migration,
which requires some additional changes to locking and extra
state tracking to ensure we don’t accidentally run a migrated
donor on a cpu it isn’t affined to, as well as some extra
handling to deal with balance callback state that needs to be
reset when we decide to pick a different task after doing donor
migration.

Much of the new logic in this version is thanks to K Prateek,
who provided a lot of insightful suggestions to the v24 series!

New in this iteration:
* With additional changes, the previous full Donor Migration
  series had gotten pretty long, so to go easy on reviewers I’ve
  dropped the later Donor Migration patches I had in v24, which
  basically provided optimizations so try_to_wake_up() would do
  return-migration, smarter mutex handoffs, and proxy migrating
  the entire chain in one pass. K Prateek also had some
  suggestions for further improvements in these later patches
  that I have not yet addressed, so for now I’m going to table
  them and will revisit once progress is made with this set.

* Fix for proxy_tag_curr() erroneously leaving tasks off of the
  pushable list, reported by K Prateek and suggested by Peter,
  allowing us to drop the proxy_tag_curr() logic completely.

* Peter noted compilers don’t always optimize as we would like,
  and suggested reworked logic to reduce repetitive
  sched_proxy_exec() branches.

* Rework of proxy_force_return() suggested by K Prateek to use
  WF_TTWU flags, and to use attach_one_task() helper to simplify
  code.

* Other small cleanups through the series suggested by
  K Prateek.

I’d love to get further feedback on any place where these
patches are confusing, or could use additional clarifications.

There’s also been some further improvements In the full Proxy
Execution series:
* David Stevens reported and diagnosed an issue with loadavg
  being incorrect due to incorrect nr_uninterruptible accounting
  in the sleeping-owner handling. 

* An issue with rwsem support was found and fixed, along with
  other simplifications to the changes.

* Fix suggested by Peter for an edge case with DL adding tasks
  twice to the pushable list when Proxy Exec pushes the donor
  task.

* K Prateek had further suggestions to improve the optimized
  donor migration changes, dropping the unnecessary
  migration_node addition to the task_struct, and using
  atttach_tasks to simplify the full chain migration.

* Tiffany Yang pointed out some unnecessary CONFIG_SMP bits
  were still lingering and could be cleaned up.

* An initial draft at Documentation update to describe Proxy
  Execution.

I’d appreciate any testing or comments that folks have with
the full set!

You can find the full Proxy Exec series here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v25-7.0-rc3/
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v25-7.0-rc3


Issues still to address with the full series:
* Resolve a regression in the later optimized donor-migration
  changes combined with “Fix 'stuck' dl_server” change in 6.19

* With the full series against 7.0-rc3, when doing heavy stress
  testing, I’m occasionally hitting crashes due to null return
  from __pick_eevdf(). Need to dig on this and find why it
  doesn’t happen against 6.18

* Try to integrate and rework K Prateek’s suggestions for the
  later optimized donor-migration changes.

* Continue working to get sched_ext to be ok with Proxy
  Execution enabled.

* Reevaluate performance regression K Prateek Nayak found with
  the full series.

* The chain migration functionality needs further iterations and
  better validation to ensure it truly maintains the RT/DL load
  balancing invariants (despite this being broken in vanilla
  upstream with RT_PUSH_IPI currently)

Future work:
* Expand to more locking primitives: Figuring out pi-futexes
  would be good, using proxy for Binder PI is something else
  we’re exploring.

* Eventually: Work to replace rt_mutexes and get things happy
  with PREEMPT_RT

I’d really appreciate any feedback or review thoughts on the
full series as well. I’m trying to keep the chunks small,
reviewable and iteratively testable, but if you have any
suggestions on how to improve the larger series, I’m all ears.

Credit/Disclaimer:
—--------------------
As always, this Proxy Execution series has a long history with
lots of developers that deserve credit:

First described in a paper[1] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!).
Thanks also to Joel Fernandes, Dietmar Eggemann, Metin Kaya,
K Prateek Nayak and Suleiman Souhlal for their substantial
review, suggestion, and patch contributions.

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are surely mine.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com

John Stultz (9):
  sched: Make class_schedulers avoid pushing current, and get rid of
    proxy_tag_curr()
  sched: Minimise repeated sched_proxy_exec() checking
  locking: Add task::blocked_lock to serialize blocked_on state
  sched: Fix modifying donor->blocked on without proper locking
  sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy
    return-migration
  sched: Add assert_balance_callbacks_empty helper
  sched: Add logic to zap balance callbacks if we pick again
  sched: Move attach_one_task and attach_task helpers to sched.h
  sched: Handle blocked-waiter migration (and return migration)

 include/linux/sched.h        |  91 +++++++----
 init/init_task.c             |   1 +
 kernel/fork.c                |   1 +
 kernel/locking/mutex-debug.c |   4 +-
 kernel/locking/mutex.c       |  40 +++--
 kernel/locking/mutex.h       |   6 +
 kernel/locking/ww_mutex.h    |  16 +-
 kernel/sched/core.c          | 300 +++++++++++++++++++++++++++++------
 kernel/sched/deadline.c      |  16 +-
 kernel/sched/fair.c          |  26 ---
 kernel/sched/rt.c            |  15 +-
 kernel/sched/sched.h         |  35 +++-
 12 files changed, 414 insertions(+), 137 deletions(-)

-- 
2.53.0.880.g73c4285caa-goog


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-13  2:30 [PATCH v25 0/9] Simple Donor Migration for Proxy Execution John Stultz
@ 2026-03-13  2:30 ` John Stultz
  2026-03-13 13:48   ` Juri Lelli
  2026-03-15 16:26   ` K Prateek Nayak
  2026-03-13  2:30 ` [PATCH v25 2/9] sched: Minimise repeated sched_proxy_exec() checking John Stultz
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 38+ messages in thread
From: John Stultz @ 2026-03-13  2:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Peter Zijlstra, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

With proxy-execution, the scheduler selects the donor, but for
blocked donors, we end up running the lock owner.

This caused some complexity, because the class schedulers make
sure to remove the task they pick from their pushable task
lists, which prevents the donor from being migrated, but there
wasn't then anything to prevent rq->curr from being migrated
if rq->curr != rq->donor.

This was sort of hacked around by calling proxy_tag_curr() on
the rq->curr task if we were running something other then the
donor. proxy_tag_curr() did a dequeue/enqueue pair on the
rq->curr task, allowing the class schedulers to remove it from
their pushable list.

The dequeue/enqueue pair was wasteful, and additonally K Prateek
highlighted that we didn't properly undo things when we stopped
proxying, leaving the lock owner off the pushable list.

After some alternative approaches were considered, Peter
suggested just having the RT/DL classes just avoid migrating
when task_on_cpu().

So rework pick_next_pushable_dl_task() and the rt
pick_next_pushable_task() functions so that they skip over the
first pushable task if it is on_cpu.

Then just drop all of the proxy_tag_curr() logic.

Fixes: be39617e38e0 ("sched: Fix proxy/current (push,pull)ability")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@amd.com/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c     | 24 ------------------------
 kernel/sched/deadline.c | 16 ++++++++++++++--
 kernel/sched/rt.c       | 15 ++++++++++++---
 3 files changed, 26 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7f77c165a6e0..d86d648a75a4b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6702,23 +6702,6 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 }
 #endif /* SCHED_PROXY_EXEC */
 
-static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
-{
-	if (!sched_proxy_exec())
-		return;
-	/*
-	 * pick_next_task() calls set_next_task() on the chosen task
-	 * at some point, which ensures it is not push/pullable.
-	 * However, the chosen/donor task *and* the mutex owner form an
-	 * atomic pair wrt push/pull.
-	 *
-	 * Make sure owner we run is not pushable. Unfortunately we can
-	 * only deal with that by means of a dequeue/enqueue cycle. :-/
-	 */
-	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
-	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
-}
-
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6871,9 +6854,6 @@ static void __sched notrace __schedule(int sched_mode)
 		 */
 		RCU_INIT_POINTER(rq->curr, next);
 
-		if (!task_current_donor(rq, next))
-			proxy_tag_curr(rq, next);
-
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -6907,10 +6887,6 @@ static void __sched notrace __schedule(int sched_mode)
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
-		/* In case next was already curr but just got blocked_donor */
-		if (!task_current_donor(rq, next))
-			proxy_tag_curr(rq, next);
-
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq, NULL);
 		raw_spin_rq_unlock_irq(rq);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d08b004293234..4e746f4de6529 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2801,12 +2801,24 @@ static int find_later_rq(struct task_struct *task)
 
 static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 {
-	struct task_struct *p;
+	struct task_struct *p = NULL;
+	struct rb_node *next_node;
 
 	if (!has_pushable_dl_tasks(rq))
 		return NULL;
 
-	p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
+	next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root);
+	while (next_node) {
+		p = __node_2_pdl(next_node);
+		/* make sure task isn't on_cpu (possible with proxy-exec) */
+		if (!task_on_cpu(rq, p))
+			break;
+
+		next_node = rb_next(next_node);
+	}
+
+	if (!p)
+		return NULL;
 
 	WARN_ON_ONCE(rq->cpu != task_cpu(p));
 	WARN_ON_ONCE(task_current(rq, p));
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f69e1f16d9238..61569b622d1a3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1853,13 +1853,22 @@ static int find_lowest_rq(struct task_struct *task)
 
 static struct task_struct *pick_next_pushable_task(struct rq *rq)
 {
-	struct task_struct *p;
+	struct plist_head *head = &rq->rt.pushable_tasks;
+	struct task_struct *i, *p = NULL;
 
 	if (!has_pushable_tasks(rq))
 		return NULL;
 
-	p = plist_first_entry(&rq->rt.pushable_tasks,
-			      struct task_struct, pushable_tasks);
+	plist_for_each_entry(i, head, pushable_tasks) {
+		/* make sure task isn't on_cpu (possible with proxy-exec) */
+		if (!task_on_cpu(rq, i)) {
+			p = i;
+			break;
+		}
+	}
+
+	if (!p)
+		return NULL;
 
 	BUG_ON(rq->cpu != task_cpu(p));
 	BUG_ON(task_current(rq, p));
-- 
2.53.0.880.g73c4285caa-goog


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v25 2/9] sched: Minimise repeated sched_proxy_exec() checking
  2026-03-13  2:30 [PATCH v25 0/9] Simple Donor Migration for Proxy Execution John Stultz
  2026-03-13  2:30 ` [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
@ 2026-03-13  2:30 ` John Stultz
  2026-03-15 17:01   ` K Prateek Nayak
  2026-03-13  2:30 ` [PATCH v25 3/9] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: John Stultz @ 2026-03-13  2:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Peter Zijlstra, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Peter noted: Compilers are really bad (as in they utterly refuse)
optimizing (even when marked with __pure) the static branch
things, and will happily emit multiple identical in a row.

So pull out the one obvious sched_proxy_exec() branch in
__schedule() and remove some of the 'implicit' ones in that
path.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d86d648a75a4b..84c61496fa263 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6597,11 +6597,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	struct mutex *mutex;
 
 	/* Follow blocked_on chain. */
-	for (p = donor; task_is_blocked(p); p = owner) {
-		mutex = p->blocked_on;
-		/* Something changed in the chain, so pick again */
-		if (!mutex)
-			return NULL;
+	for (p = donor; (mutex = p->blocked_on); p = owner) {
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
 		 * and ensure @owner sticks around.
@@ -6832,12 +6828,14 @@ static void __sched notrace __schedule(int sched_mode)
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
 	rq->next_class = next->sched_class;
-	if (unlikely(task_is_blocked(next))) {
-		next = find_proxy_task(rq, next, &rf);
-		if (!next)
-			goto pick_again;
-		if (next == rq->idle)
-			goto keep_resched;
+	if (sched_proxy_exec()) {
+		if (unlikely(next->blocked_on)) {
+			next = find_proxy_task(rq, next, &rf);
+			if (!next)
+				goto pick_again;
+			if (next == rq->idle)
+				goto keep_resched;
+		}
 	}
 picked:
 	clear_tsk_need_resched(prev);
-- 
2.53.0.880.g73c4285caa-goog


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v25 3/9] locking: Add task::blocked_lock to serialize blocked_on state
  2026-03-13  2:30 [PATCH v25 0/9] Simple Donor Migration for Proxy Execution John Stultz
  2026-03-13  2:30 ` [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
  2026-03-13  2:30 ` [PATCH v25 2/9] sched: Minimise repeated sched_proxy_exec() checking John Stultz
@ 2026-03-13  2:30 ` John Stultz
  2026-03-13  2:30 ` [PATCH v25 4/9] sched: Fix modifying donor->blocked on without proper locking John Stultz
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2026-03-13  2:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

So far, we have been able to utilize the mutex::wait_lock
for serializing the blocked_on state, but when we move to
proxying across runqueues, we will need to add more state
and a way to serialize changes to this state in contexts
where we don't hold the mutex::wait_lock.

So introduce the task::blocked_lock, which nests under the
mutex::wait_lock in the locking order, and rework the locking
to use it.

Signed-off-by: John Stultz <jstultz@google.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
v15:
* Split back out into later in the series
v16:
* Fixups to mark tasks unblocked before sleeping in
  mutex_optimistic_spin()
* Rework to use guard() as suggested by Peter
v19:
* Rework logic for PREEMPT_RT issues reported by
  K Prateek Nayak
v21:
* After recently thinking more on ww_mutex code, I
  reworked the blocked_lock usage in mutex lock to
  avoid having to take nested locks in the ww_mutex
  paths, as I was concerned the lock ordering
  constraints weren't as strong as I had previously
  thought.
v22:
* Added some extra spaces to avoid dense code blocks
  suggested by K Prateek
v23:
* Move get_task_blocked_on() to kernel/locking/mutex.h
  as requested by PeterZ

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h        | 48 +++++++++++++-----------------------
 init/init_task.c             |  1 +
 kernel/fork.c                |  1 +
 kernel/locking/mutex-debug.c |  4 +--
 kernel/locking/mutex.c       | 40 +++++++++++++++++++-----------
 kernel/locking/mutex.h       |  6 +++++
 kernel/locking/ww_mutex.h    |  4 +--
 kernel/sched/core.c          |  4 ++-
 8 files changed, 58 insertions(+), 50 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a7b4a980eb2f0..f9924ec02c4f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1238,6 +1238,7 @@ struct task_struct {
 #endif
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
 	/*
@@ -2181,57 +2182,42 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
 #ifndef CONFIG_PREEMPT_RT
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
-	struct mutex *m = p->blocked_on;
-
-	if (m)
-		lockdep_assert_held_once(&m->wait_lock);
-	return m;
+	lockdep_assert_held_once(&p->blocked_lock);
+	return p->blocked_on;
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
 	WARN_ON_ONCE(!m);
 	/* The task should only be setting itself as blocked */
 	WARN_ON_ONCE(p != current);
-	/* Currently we serialize blocked_on under the mutex::wait_lock */
-	lockdep_assert_held_once(&m->wait_lock);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
 	/*
 	 * Check ensure we don't overwrite existing mutex value
 	 * with a different mutex. Note, setting it to the same
 	 * lock repeatedly is ok.
 	 */
-	WARN_ON_ONCE(blocked_on && blocked_on != m);
-	WRITE_ONCE(p->blocked_on, m);
-}
-
-static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
-{
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
-	__set_task_blocked_on(p, m);
+	WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
+	p->blocked_on = m;
 }
 
 static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	if (m) {
-		struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
-		/* Currently we serialize blocked_on under the mutex::wait_lock */
-		lockdep_assert_held_once(&m->wait_lock);
-		/*
-		 * There may be cases where we re-clear already cleared
-		 * blocked_on relationships, but make sure we are not
-		 * clearing the relationship with a different lock.
-		 */
-		WARN_ON_ONCE(blocked_on && blocked_on != m);
-	}
-	WRITE_ONCE(p->blocked_on, NULL);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+	/*
+	 * There may be cases where we re-clear already cleared
+	 * blocked_on relationships, but make sure we are not
+	 * clearing the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	p->blocked_on = NULL;
 }
 
 static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
 #else
diff --git a/init/init_task.c b/init/init_task.c
index 5c838757fc10e..b5f48ebdc2b6e 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -169,6 +169,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+	.blocked_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
 	.timer_slack_ns = 50000, /* 50 usec default slack */
 	.thread_pid	= &init_struct_pid,
 	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
diff --git a/kernel/fork.c b/kernel/fork.c
index 65113a304518a..f233316ffad42 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2076,6 +2076,7 @@ __latent_entropy struct task_struct *copy_process(
 	ftrace_graph_init_task(p);
 
 	rt_mutex_init_task(p);
+	raw_spin_lock_init(&p->blocked_lock);
 
 	lockdep_assert_irqs_enabled();
 #ifdef CONFIG_PROVE_LOCKING
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 2c6b02d4699be..cc6aa9c6e9813 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 	lockdep_assert_held(&lock->wait_lock);
 
 	/* Current thread can't be already blocked (since it's executing!) */
-	DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task));
+	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
-	struct mutex *blocked_on = __get_task_blocked_on(task);
+	struct mutex *blocked_on = get_task_blocked_on(task);
 
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2a1d165b3167e..4aa79bcab08c7 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -656,6 +656,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
+	raw_spin_lock(&current->blocked_lock);
 	__set_task_blocked_on(current, lock);
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
@@ -669,8 +670,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 * the handoff.
 		 */
 		if (__mutex_trylock(lock))
-			goto acquired;
+			break;
 
+		raw_spin_unlock(&current->blocked_lock);
 		/*
 		 * Check for signals and kill conditions while holding
 		 * wait_lock. This ensures the lock cancellation is ordered
@@ -693,12 +695,14 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
+		raw_spin_lock(&current->blocked_lock);
 		/*
 		 * As we likely have been woken up by task
 		 * that has cleared our blocked_on state, re-set
 		 * it to the lock we are trying to acquire.
 		 */
-		set_task_blocked_on(current, lock);
+		__set_task_blocked_on(current, lock);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -709,25 +713,33 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			break;
 
 		if (first) {
-			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			bool opt_acquired;
+
 			/*
 			 * mutex_optimistic_spin() can call schedule(), so
-			 * clear blocked on so we don't become unselectable
+			 * we need to release these locks before calling it,
+			 * and clear blocked on so we don't become unselectable
 			 * to run.
 			 */
-			clear_task_blocked_on(current, lock);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			__clear_task_blocked_on(current, lock);
+			raw_spin_unlock(&current->blocked_lock);
+			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+
+			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
+
+			raw_spin_lock_irqsave(&lock->wait_lock, flags);
+			raw_spin_lock(&current->blocked_lock);
+			__set_task_blocked_on(current, lock);
+
+			if (opt_acquired)
 				break;
-			set_task_blocked_on(current, lock);
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
-
-		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-acquired:
 	__clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
+	raw_spin_unlock(&current->blocked_lock);
 
 	if (ww_ctx) {
 		/*
@@ -756,11 +768,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	return 0;
 
 err:
-	__clear_task_blocked_on(current, lock);
+	clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
-	WARN_ON(__get_task_blocked_on(current));
+	WARN_ON(get_task_blocked_on(current));
 	trace_contention_end(lock, ret);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 	debug_mutex_free_waiter(&waiter);
@@ -971,7 +983,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		__clear_task_blocked_on(next, lock);
+		clear_task_blocked_on(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 9ad4da8cea004..7a8ba13fee949 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -47,6 +47,12 @@ static inline struct task_struct *__mutex_owner(struct mutex *lock)
 	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
 }
 
+static inline struct mutex *get_task_blocked_on(struct task_struct *p)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	return __get_task_blocked_on(p);
+}
+
 #ifdef CONFIG_DEBUG_MUTEXES
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 31a785afee6c0..e4a81790ea7dd 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -289,7 +289,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		 * blocked_on pointer. Otherwise we can see circular
 		 * blocked_on relationships that can't resolve.
 		 */
-		__clear_task_blocked_on(waiter->task, lock);
+		clear_task_blocked_on(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -347,7 +347,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			__clear_task_blocked_on(owner, NULL);
+			clear_task_blocked_on(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84c61496fa263..96e2784dbba49 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6584,6 +6584,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
  *   p->pi_lock
  *     rq->lock
  *       mutex->wait_lock
+ *         p->blocked_lock
  *
  * Returns the task that is going to be used as execution context (the one
  * that is actually going to be run on cpu_of(rq)).
@@ -6603,8 +6604,9 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * and ensure @owner sticks around.
 		 */
 		guard(raw_spinlock)(&mutex->wait_lock);
+		guard(raw_spinlock)(&p->blocked_lock);
 
-		/* Check again that p is blocked with wait_lock held */
+		/* Check again that p is blocked with blocked_lock held */
 		if (mutex != __get_task_blocked_on(p)) {
 			/*
 			 * Something changed in the blocked_on chain and
-- 
2.53.0.880.g73c4285caa-goog


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v25 4/9] sched: Fix modifying donor->blocked on without proper locking
  2026-03-13  2:30 [PATCH v25 0/9] Simple Donor Migration for Proxy Execution John Stultz
                   ` (2 preceding siblings ...)
  2026-03-13  2:30 ` [PATCH v25 3/9] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
@ 2026-03-13  2:30 ` John Stultz
  2026-03-13  2:30 ` [PATCH v25 5/9] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2026-03-13  2:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Introduce an action enum in find_proxy_task() which allows
us to handle work needed to be done outside the mutex.wait_lock
and task.blocked_lock guard scopes.

This ensures proper locking when we clear the donor's blocked_on
pointer in proxy_deactivate(), and the switch statement will be
useful as we add more cases to handle later in this series.

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v23:
* Split out from earlier patch.
v24:
* Minor re-ordering local variables to keep with style
  as suggested by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 96e2784dbba49..0bb7272106c9e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6568,7 +6568,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 		 * as unblocked, as we aren't doing proxy-migrations
 		 * yet (more logic will be needed then).
 		 */
-		donor->blocked_on = NULL;
+		clear_task_blocked_on(donor, NULL);
 	}
 	return NULL;
 }
@@ -6592,6 +6592,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
+	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
 	struct task_struct *owner = NULL;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
@@ -6625,12 +6626,14 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
-			return proxy_deactivate(rq, donor);
+			action = DEACTIVATE_DONOR;
+			break;
 		}
 
 		if (task_cpu(owner) != this_cpu) {
 			/* XXX Don't handle migrations yet */
-			return proxy_deactivate(rq, donor);
+			action = DEACTIVATE_DONOR;
+			break;
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6688,6 +6691,13 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 */
 	}
 
+	/* Handle actions we need to do outside of the guard() scope */
+	switch (action) {
+	case DEACTIVATE_DONOR:
+		return proxy_deactivate(rq, donor);
+	case FOUND:
+		/* fallthrough */;
+	}
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
 }
-- 
2.53.0.880.g73c4285caa-goog


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v25 5/9] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
  2026-03-13  2:30 [PATCH v25 0/9] Simple Donor Migration for Proxy Execution John Stultz
                   ` (3 preceding siblings ...)
  2026-03-13  2:30 ` [PATCH v25 4/9] sched: Fix modifying donor->blocked on without proper locking John Stultz
@ 2026-03-13  2:30 ` John Stultz
  2026-03-13  2:30 ` [PATCH v25 6/9] sched: Add assert_balance_callbacks_empty helper John Stultz
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2026-03-13  2:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

As we add functionality to proxy execution, we may migrate a
donor task to a runqueue where it can't run due to cpu affinity.
Thus, we must be careful to ensure we return-migrate the task
back to a cpu in its cpumask when it becomes unblocked.

Peter helpfully provided the following example with pictures:
"Suppose we have a ww_mutex cycle:

                  ,-+-* Mutex-1 <-.
        Task-A ---' |             | ,-- Task-B
                    `-> Mutex-2 *-+-'

Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
where Task-B holds Mutex-2 and tries to acquire Mutex-1.

Then the blocked_on->owner chain will go in circles.

        Task-A  -> Mutex-2
          ^          |
          |          v
        Mutex-1 <- Task-B

We need two things:

 - find_proxy_task() to stop iterating the circle;

 - the woken task to 'unblock' and run, such that it can
   back-off and re-try the transaction.

Now, the current code [without this patch] does:
        __clear_task_blocked_on();
        wake_q_add();

And surely clearing ->blocked_on is sufficient to break the
cycle.

Suppose it is Task-B that is made to back-off, then we have:

  Task-A -> Mutex-2 -> Task-B (no further blocked_on)

and it would attempt to run Task-B. Or worse, it could directly
pick Task-B and run it, without ever getting into
find_proxy_task().

Now, here is a problem because Task-B might not be runnable on
the CPU it is currently on; and because !task_is_blocked() we
don't get into the proxy paths, so nobody is going to fix this
up.

Ideally we would have dequeued Task-B alongside of clearing
->blocked_on, but alas, [the lock ordering prevents us from
getting the task_rq_lock() and] spoils things."

Thus we need more than just a binary concept of the task being
blocked on a mutex or not.

So allow setting blocked_on to PROXY_WAKING as a special value
which specifies the task is no longer blocked, but needs to
be evaluated for return migration *before* it can be run.

This will then be used in a later patch to handle proxy
return-migration.

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v15:
* Split blocked_on_state into its own patch later in the
  series, as the tri-state isn't necessary until we deal
  with proxy/return migrations
v16:
* Handle case where task in the chain is being set as
  BO_WAKING by another cpu (usually via ww_mutex die code).
  Make sure we release the rq lock so the wakeup can
  complete.
* Rework to use guard() in find_proxy_task() as suggested
  by Peter
v18:
* Add initialization of blocked_on_state for init_task
v19:
* PREEMPT_RT build fixups and rework suggested by
  K Prateek Nayak
v20:
* Simplify one of the blocked_on_state changes to avoid extra
  PREMEPT_RT conditionals
v21:
* Slight reworks due to avoiding nested blocked_lock locking
* Be consistent in use of blocked_on_state helper functions
* Rework calls to proxy_deactivate() to do proper locking
  around blocked_on_state changes that we were cheating in
  previous versions.
* Minor cleanups, some comment improvements
v22:
* Re-order blocked_on_state helpers to try to make it clearer
  the set_task_blocked_on() and clear_task_blocked_on() are
  the main enter/exit states and the blocked_on_state helpers
  help manage the transition states within. Per feedback from
  K Prateek Nayak.
* Rework blocked_on_state to be defined within
  CONFIG_SCHED_PROXY_EXEC as suggested by K Prateek Nayak.
* Reworked empty stub functions to just take one line as
  suggestd by K Prateek
* Avoid using gotos out of a guard() scope, as highlighted by
  K Prateek, and instead rework logic to break and switch()
  on an action value.
v23:
* Big rework to using PROXY_WAKING instead of blocked_on_state
  as suggested by Peter.
* Reworked commit message to include Peter's nice diagrams and
  example for why this extra state is necessary.

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h     | 51 +++++++++++++++++++++++++++++++++++++--
 kernel/locking/mutex.c    |  2 +-
 kernel/locking/ww_mutex.h | 16 ++++++------
 kernel/sched/core.c       | 16 ++++++++++++
 4 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f9924ec02c4f2..24b7b4a48ce03 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2180,10 +2180,20 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
 })
 
 #ifndef CONFIG_PREEMPT_RT
+
+/*
+ * With proxy exec, if a task has been proxy-migrated, it may be a donor
+ * on a cpu that it can't actually run on. Thus we need a special state
+ * to denote that the task is being woken, but that it needs to be
+ * evaluated for return-migration before it is run. So if the task is
+ * blocked_on PROXY_WAKING, return migrate it before running it.
+ */
+#define PROXY_WAKING ((struct mutex *)(-1L))
+
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
 	lockdep_assert_held_once(&p->blocked_lock);
-	return p->blocked_on;
+	return p->blocked_on == PROXY_WAKING ? NULL : p->blocked_on;
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2211,7 +2221,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
 	 * blocked_on relationships, but make sure we are not
 	 * clearing the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
 	p->blocked_on = NULL;
 }
 
@@ -2220,6 +2230,35 @@ static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
+
+static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
+{
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+
+	if (!sched_proxy_exec()) {
+		__clear_task_blocked_on(p, m);
+		return;
+	}
+
+	/* Don't set PROXY_WAKING if blocked_on was already cleared */
+	if (!p->blocked_on)
+		return;
+	/*
+	 * There may be cases where we set PROXY_WAKING on tasks that were
+	 * already set to waking, but make sure we are not changing
+	 * the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	p->blocked_on = PROXY_WAKING;
+}
+
+static inline void set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	__set_task_blocked_on_waking(p, m);
+}
+
 #else
 static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
@@ -2228,6 +2267,14 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mute
 static inline void clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
 }
+
+static inline void __set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
+{
+}
+
+static inline void set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
+{
+}
 #endif /* !CONFIG_PREEMPT_RT */
 
 static __always_inline bool need_resched(void)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 4aa79bcab08c7..7d359647156df 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -983,7 +983,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		clear_task_blocked_on(next, lock);
+		set_task_blocked_on_waking(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index e4a81790ea7dd..5cd9dfa4b31e6 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
 		/*
-		 * When waking up the task to die, be sure to clear the
-		 * blocked_on pointer. Otherwise we can see circular
-		 * blocked_on relationships that can't resolve.
+		 * When waking up the task to die, be sure to set the
+		 * blocked_on to PROXY_WAKING. Otherwise we can see
+		 * circular blocked_on relationships that can't resolve.
 		 */
-		clear_task_blocked_on(waiter->task, lock);
+		set_task_blocked_on_waking(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -339,15 +339,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 */
 		if (owner != current) {
 			/*
-			 * When waking up the task to wound, be sure to clear the
-			 * blocked_on pointer. Otherwise we can see circular
-			 * blocked_on relationships that can't resolve.
+			 * When waking up the task to wound, be sure to set the
+			 * blocked_on to PROXY_WAKING. Otherwise we can see
+			 * circular blocked_on relationships that can't resolve.
 			 *
 			 * NOTE: We pass NULL here instead of lock, because we
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			clear_task_blocked_on(owner, NULL);
+			set_task_blocked_on_waking(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0bb7272106c9e..7212a439124a9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4242,6 +4242,13 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
+	/*
+	 * For now, if we've been woken up, clear the task->blocked_on
+	 * regardless if it was set to a mutex or PROXY_WAKING so the
+	 * task can run. We will need to be more careful later when
+	 * properly handling proxy migration
+	 */
+	clear_task_blocked_on(p, NULL);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6600,6 +6607,10 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 	/* Follow blocked_on chain. */
 	for (p = donor; (mutex = p->blocked_on); p = owner) {
+		/* if its PROXY_WAKING, resched_idle so ttwu can complete */
+		if (mutex == PROXY_WAKING)
+			return proxy_resched_idle(rq);
+
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
 		 * and ensure @owner sticks around.
@@ -6620,6 +6631,11 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 		owner = __mutex_owner(mutex);
 		if (!owner) {
+			/*
+			 * If there is no owner, clear blocked_on
+			 * and return p so it can run and try to
+			 * acquire the lock
+			 */
 			__clear_task_blocked_on(p, mutex);
 			return p;
 		}
-- 
2.53.0.880.g73c4285caa-goog


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v25 6/9] sched: Add assert_balance_callbacks_empty helper
  2026-03-13  2:30 [PATCH v25 0/9] Simple Donor Migration for Proxy Execution John Stultz
                   ` (4 preceding siblings ...)
  2026-03-13  2:30 ` [PATCH v25 5/9] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
@ 2026-03-13  2:30 ` John Stultz
  2026-03-13  2:30 ` [PATCH v25 7/9] sched: Add logic to zap balance callbacks if we pick again John Stultz
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2026-03-13  2:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Peter Zijlstra, K Prateek Nayak, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

With proxy-exec utilizing pick-again logic, we can end up having
balance callbacks set by the preivous pick_next_task() call left
on the list.

So pull the warning out into a helper function, and make sure we
check it when we pick again.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v24:
* Use IS_ENABLED() as suggested by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c  | 1 +
 kernel/sched/sched.h | 9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7212a439124a9..ec9e8fe39f9fc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6853,6 +6853,7 @@ static void __sched notrace __schedule(int sched_mode)
 	}
 
 pick_again:
+	assert_balance_callbacks_empty(rq);
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
 	rq->next_class = next->sched_class;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 43bbf0693cca4..2a0236d745832 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1853,6 +1853,13 @@ static inline void scx_rq_clock_update(struct rq *rq, u64 clock) {}
 static inline void scx_rq_clock_invalidate(struct rq *rq) {}
 #endif /* !CONFIG_SCHED_CLASS_EXT */
 
+static inline void assert_balance_callbacks_empty(struct rq *rq)
+{
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_PROVE_LOCKING) &&
+		     rq->balance_callback &&
+		     rq->balance_callback != &balance_push_callback);
+}
+
 /*
  * Lockdep annotation that avoids accidental unlocks; it's like a
  * sticky/continuous lockdep_assert_held().
@@ -1869,7 +1876,7 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
 	rf->clock_update_flags = 0;
-	WARN_ON_ONCE(rq->balance_callback && rq->balance_callback != &balance_push_callback);
+	assert_balance_callbacks_empty(rq);
 }
 
 static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
-- 
2.53.0.880.g73c4285caa-goog


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v25 7/9] sched: Add logic to zap balance callbacks if we pick again
  2026-03-13  2:30 [PATCH v25 0/9] Simple Donor Migration for Proxy Execution John Stultz
                   ` (5 preceding siblings ...)
  2026-03-13  2:30 ` [PATCH v25 6/9] sched: Add assert_balance_callbacks_empty helper John Stultz
@ 2026-03-13  2:30 ` John Stultz
  2026-03-13  2:30 ` [PATCH v25 8/9] sched: Move attach_one_task and attach_task helpers to sched.h John Stultz
  2026-03-13  2:30 ` [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration) John Stultz
  8 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2026-03-13  2:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

With proxy-exec, a task is selected to run via pick_next_task(),
and then if it is a mutex blocked task, we call find_proxy_task()
to find a runnable owner. If the runnable owner is on another
cpu, we will need to migrate the selected donor task away, after
which we will pick_again can call pick_next_task() to choose
something else.

However, in the first call to pick_next_task(), we may have
had a balance_callback setup by the class scheduler. After we
pick again, its possible pick_next_task_fair() will be called
which calls sched_balance_newidle() and sched_balance_rq().

This will throw a warning:
[    8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback
[    8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250
...
[    8.796467] Call Trace:
[    8.796467]  <TASK>
[    8.796467]  ? __warn.cold+0xb2/0x14e
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  ? report_bug+0x107/0x1a0
[    8.796467]  ? handle_bug+0x54/0x90
[    8.796467]  ? exc_invalid_op+0x17/0x70
[    8.796467]  ? asm_exc_invalid_op+0x1a/0x20
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  sched_balance_newidle+0x295/0x820
[    8.796467]  pick_next_task_fair+0x51/0x3f0
[    8.796467]  __schedule+0x23a/0x14b0
[    8.796467]  ? lock_release+0x16d/0x2e0
[    8.796467]  schedule+0x3d/0x150
[    8.796467]  worker_thread+0xb5/0x350
[    8.796467]  ? __pfx_worker_thread+0x10/0x10
[    8.796467]  kthread+0xee/0x120
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork+0x31/0x50
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork_asm+0x1a/0x30
[    8.796467]  </TASK>

This is because if a RT task was originally picked, it will
setup the rq->balance_callback with push_rt_tasks() via
set_next_task_rt().

Once the task is migrated away and we pick again, we haven't
processed any balance callbacks, so rq->balance_callback is not
in the same state as it was the first time pick_next_task was
called.

To handle this, add a zap_balance_callbacks() helper function
which cleans up the balance callbacks without running them. This
should be ok, as we are effectively undoing the state set in
the first call to pick_next_task(), and when we pick again,
the new callback can be configured for the donor task actually
selected.

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v20:
* Tweaked to avoid build issues with different configs
v22:
* Spelling fix suggested by K Prateek
* Collapsed the stub implementation to one line as suggested
  by K Prateek
* Zap callbacks when we resched idle, as suggested by K Prateek
v24:
* Don't conditionalize function on CONFIG_SCHED_PROXY_EXEC as
  the callers will be optimized out if that is unset, and the
  dead function will be removed, as suggsted by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec9e8fe39f9fc..af497b8c72dce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4917,6 +4917,34 @@ static inline void finish_task(struct task_struct *prev)
 	smp_store_release(&prev->on_cpu, 0);
 }
 
+/*
+ * Only called from __schedule context
+ *
+ * There are some cases where we are going to re-do the action
+ * that added the balance callbacks. We may not be in a state
+ * where we can run them, so just zap them so they can be
+ * properly re-added on the next time around. This is similar
+ * handling to running the callbacks, except we just don't call
+ * them.
+ */
+static void zap_balance_callbacks(struct rq *rq)
+{
+	struct balance_callback *next, *head;
+	bool found = false;
+
+	lockdep_assert_rq_held(rq);
+
+	head = rq->balance_callback;
+	while (head) {
+		if (head == &balance_push_callback)
+			found = true;
+		next = head->next;
+		head->next = NULL;
+		head = next;
+	}
+	rq->balance_callback = found ? &balance_push_callback : NULL;
+}
+
 static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
 {
 	void (*func)(struct rq *rq);
@@ -6860,10 +6888,14 @@ static void __sched notrace __schedule(int sched_mode)
 	if (sched_proxy_exec()) {
 		if (unlikely(next->blocked_on)) {
 			next = find_proxy_task(rq, next, &rf);
-			if (!next)
+			if (!next) {
+				zap_balance_callbacks(rq);
 				goto pick_again;
-			if (next == rq->idle)
+			}
+			if (next == rq->idle) {
+				zap_balance_callbacks(rq);
 				goto keep_resched;
+			}
 		}
 	}
 picked:
-- 
2.53.0.880.g73c4285caa-goog


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v25 8/9] sched: Move attach_one_task and attach_task helpers to sched.h
  2026-03-13  2:30 [PATCH v25 0/9] Simple Donor Migration for Proxy Execution John Stultz
                   ` (6 preceding siblings ...)
  2026-03-13  2:30 ` [PATCH v25 7/9] sched: Add logic to zap balance callbacks if we pick again John Stultz
@ 2026-03-13  2:30 ` John Stultz
  2026-03-15 16:34   ` K Prateek Nayak
  2026-03-13  2:30 ` [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration) John Stultz
  8 siblings, 1 reply; 38+ messages in thread
From: John Stultz @ 2026-03-13  2:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

The fair scheduler locally introduced attach_one_task() and
attach_task() helpers, but these could be generically useful
so move this code to sched.h so we can use them elsewhere.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/fair.c  | 26 --------------------------
 kernel/sched/sched.h | 26 ++++++++++++++++++++++++++
 2 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf948db905ed1..53da01a251487 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9784,32 +9784,6 @@ static int detach_tasks(struct lb_env *env)
 	return detached;
 }
 
-/*
- * attach_task() -- attach the task detached by detach_task() to its new rq.
- */
-static void attach_task(struct rq *rq, struct task_struct *p)
-{
-	lockdep_assert_rq_held(rq);
-
-	WARN_ON_ONCE(task_rq(p) != rq);
-	activate_task(rq, p, ENQUEUE_NOCLOCK);
-	wakeup_preempt(rq, p, 0);
-}
-
-/*
- * attach_one_task() -- attaches the task returned from detach_one_task() to
- * its new rq.
- */
-static void attach_one_task(struct rq *rq, struct task_struct *p)
-{
-	struct rq_flags rf;
-
-	rq_lock(rq, &rf);
-	update_rq_clock(rq);
-	attach_task(rq, p);
-	rq_unlock(rq, &rf);
-}
-
 /*
  * attach_tasks() -- attaches all tasks detached by detach_tasks() to their
  * new rq.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2a0236d745832..d4a73c3db03d4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3008,6 +3008,32 @@ extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
 
 extern void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags);
 
+/*
+ * attach_task() -- attach the task detached by detach_task() to its new rq.
+ */
+static inline void attach_task(struct rq *rq, struct task_struct *p)
+{
+	lockdep_assert_rq_held(rq);
+
+	WARN_ON_ONCE(task_rq(p) != rq);
+	activate_task(rq, p, ENQUEUE_NOCLOCK);
+	wakeup_preempt(rq, p, 0);
+}
+
+/*
+ * attach_one_task() -- attaches the task returned from detach_one_task() to
+ * its new rq.
+ */
+static inline void attach_one_task(struct rq *rq, struct task_struct *p)
+{
+	struct rq_flags rf;
+
+	rq_lock(rq, &rf);
+	update_rq_clock(rq);
+	attach_task(rq, p);
+	rq_unlock(rq, &rf);
+}
+
 #ifdef CONFIG_PREEMPT_RT
 # define SCHED_NR_MIGRATE_BREAK 8
 #else
-- 
2.53.0.880.g73c4285caa-goog


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)
  2026-03-13  2:30 [PATCH v25 0/9] Simple Donor Migration for Proxy Execution John Stultz
                   ` (7 preceding siblings ...)
  2026-03-13  2:30 ` [PATCH v25 8/9] sched: Move attach_one_task and attach_task helpers to sched.h John Stultz
@ 2026-03-13  2:30 ` John Stultz
  2026-03-15 17:38   ` K Prateek Nayak
                     ` (3 more replies)
  8 siblings, 4 replies; 38+ messages in thread
From: John Stultz @ 2026-03-13  2:30 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run, also
modify the scheduling classes to avoid sched class migrations on
mutex blocked tasks, leaving find_proxy_task() and related logic
to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
---
v6:
* Integrated sched_proxy_exec() check in proxy_return_migration()
* Minor cleanups to diff
* Unpin the rq before calling __balance_callbacks()
* Tweak proxy migrate to migrate deeper task in chain, to avoid
  tasks pingponging between rqs
v7:
* Fixup for unused function arguments
* Switch from that_rq -> target_rq, other minor tweaks, and typo
  fixes suggested by Metin Kaya
* Switch back to doing return migration in the ttwu path, which
  avoids nasty lock juggling and performance issues
* Fixes for UP builds
v8:
* More simplifications from Metin Kaya
* Fixes for null owner case, including doing return migration
* Cleanup proxy_needs_return logic
v9:
* Narrow logic in ttwu that sets BO_RUNNABLE, to avoid missed
  return migrations
* Switch to using zap_balance_callbacks rathern then running
  them when we are dropping rq locks for proxy_migration.
* Drop task_is_blocked check in sched_submit_work as suggested
  by Metin (may re-add later if this causes trouble)
* Do return migration when we're not on wake_cpu. This avoids
  bad task placement caused by proxy migrations raised by
  Xuewen Yan
* Fix to call set_next_task(rq->curr) prior to dropping rq lock
  to avoid rq->curr getting migrated before we have actually
  switched from it
* Cleanup to re-use proxy_resched_idle() instead of open coding
  it in proxy_migrate_task()
* Fix return migration not to use DEQUEUE_SLEEP, so that we
  properly see the task as task_on_rq_migrating() after it is
  dequeued but before set_task_cpu() has been called on it
* Fix to broaden find_proxy_task() checks to avoid race where
  a task is dequeued off the rq due to return migration, but
  set_task_cpu() and the enqueue on another rq happened after
  we checked task_cpu(owner). This ensures we don't proxy
  using a task that is not actually on our runqueue.
* Cleanup to avoid the locked BO_WAKING->BO_RUNNABLE transition
  in try_to_wake_up() if proxy execution isn't enabled.
* Cleanup to improve comment in proxy_migrate_task() explaining
  the set_next_task(rq->curr) logic
* Cleanup deadline.c change to stylistically match rt.c change
* Numerous cleanups suggested by Metin
v10:
* Drop WARN_ON(task_is_blocked(p)) in ttwu current case
v11:
* Include proxy_set_task_cpu from later in the series to this
  change so we can use it, rather then reworking logic later
  in the series.
* Fix problem with return migration, where affinity was changed
  and wake_cpu was left outside the affinity mask.
* Avoid reading the owner's cpu twice (as it might change inbetween)
  to avoid occasional migration-to-same-cpu edge cases
* Add extra WARN_ON checks for wake_cpu and return migration
  edge cases.
* Typo fix from Metin
v13:
* As we set ret, return it, not just NULL (pulling this change
  in from later patch)
* Avoid deadlock between try_to_wake_up() and find_proxy_task() when
  blocked_on cycle with ww_mutex is trying a mid-chain wakeup.
* Tweaks to use new __set_blocked_on_runnable() helper
* Potential fix for incorrectly updated task->dl_server issues
* Minor comment improvements
* Add logic to handle missed wakeups, in that case doing return
  migration from the find_proxy_task() path
* Minor cleanups
v14:
* Improve edge cases where we wouldn't set the task as BO_RUNNABLE
v15:
* Added comment to better describe proxy_needs_return() as suggested
  by Qais
* Build fixes for !CONFIG_SMP reported by
  Maciej Żenczykowski <maze@google.com>
* Adds fix for re-evaluating proxy_needs_return when
  sched_proxy_exec() is disabled, reported and diagnosed by:
  kuyo chang <kuyo.chang@mediatek.com>
v16:
* Larger rework of needs_return logic in find_proxy_task, in
  order to avoid problems with cpuhotplug
* Rework to use guard() as suggested by Peter
v18:
* Integrate optimization suggested by Suleiman to do the checks
  for sleeping owners before checking if the task_cpu is this_cpu,
  so that we can avoid needlessly proxy-migrating tasks to only
  then dequeue them. Also check if migrating last.
* Improve comments around guard locking
* Include tweak to ttwu_runnable() as suggested by
  hupu <hupu.gm@gmail.com>
* Rework the logic releasing the rq->donor reference before letting
  go of the rqlock. Just use rq->idle.
* Go back to doing return migration on BO_WAKING owners, as I was
  hitting some softlockups caused by running tasks not making
  it out of BO_WAKING.
v19:
* Fixed proxy_force_return() logic for !SMP cases
v21:
* Reworked donor deactivation for unhandled sleeping owners
* Commit message tweaks
v22:
* Add comments around zap_balance_callbacks in proxy_migration logic
* Rework logic to avoid gotos out of guard() scopes, and instead
  use break and switch() on action value, as suggested by K Prateek
* K Prateek suggested simplifications around putting donor and
  setting idle as next task in the migration paths, which I further
  simplified to using proxy_resched_idle()
* Comment improvements
* Dropped curr != donor check in pick_next_task_fair() suggested by
  K Prateek
v23:
* Rework to use the PROXY_WAKING approach suggested by Peter
* Drop unnecessarily setting wake_cpu when affinity changes
  as noticed by Peter
* Split out the ttwu() logic changes into a later separate patch
  as suggested by Peter
v24:
* Numerous fixes for rq clock handling, pointed out by K Prateek
* Slight tweak to where put_task() is called suggested by K Prateek
v25:
* Use WF_TTWU in proxy_force_return(), suggested by K Prateek
* Drop get/put_task_struct() in proxy_force_return(), suggested by
  K Prateek
* Use attach_one_task() to reduce repetitive logic, as suggested
  by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 221 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 191 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index af497b8c72dce..fe20204cf51cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3643,6 +3643,23 @@ void update_rq_avg_idle(struct rq *rq)
 	rq->idle_stamp = 0;
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
+{
+	unsigned int wake_cpu;
+
+	/*
+	 * Since we are enqueuing a blocked task on a cpu it may
+	 * not be able to run on, preserve wake_cpu when we
+	 * __set_task_cpu so we can return the task to where it
+	 * was previously runnable.
+	 */
+	wake_cpu = p->wake_cpu;
+	__set_task_cpu(p, cpu);
+	p->wake_cpu = wake_cpu;
+}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
+
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		 struct rq_flags *rf)
@@ -4242,13 +4259,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
-	/*
-	 * For now, if we've been woken up, clear the task->blocked_on
-	 * regardless if it was set to a mutex or PROXY_WAKING so the
-	 * task can run. We will need to be more careful later when
-	 * properly handling proxy migration
-	 */
-	clear_task_blocked_on(p, NULL);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6575,7 +6585,7 @@ static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 	return rq->idle;
 }
 
-static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
@@ -6595,17 +6605,135 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	return try_to_block_task(rq, donor, &state, true);
 }
 
-static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
+/*
+ * If the blocked-on relationship crosses CPUs, migrate @p to the
+ * owner's CPU.
+ *
+ * This is because we must respect the CPU affinity of execution
+ * contexts (owner) but we can ignore affinity for scheduling
+ * contexts (@p). So we have to move scheduling contexts towards
+ * potential execution contexts.
+ *
+ * Note: The owner can disappear, but simply migrate to @target_cpu
+ * and leave that CPU to sort things out.
+ */
+static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p, int target_cpu)
 {
-	if (!__proxy_deactivate(rq, donor)) {
-		/*
-		 * XXX: For now, if deactivation failed, set donor
-		 * as unblocked, as we aren't doing proxy-migrations
-		 * yet (more logic will be needed then).
-		 */
-		clear_task_blocked_on(donor, NULL);
+	struct rq *target_rq = cpu_rq(target_cpu);
+
+	lockdep_assert_rq_held(rq);
+
+	/*
+	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
+	 * otherwise we have a reference that no longer belongs to us.
+	 *
+	 * Additionally, as we put_prev_task(prev) earlier, its possible that
+	 * prev will migrate away as soon as we drop the rq lock, however we
+	 * still have it marked as rq->curr, as we've not yet switched tasks.
+	 *
+	 * So call proxy_resched_idle() to let go of the references before
+	 * we release the lock.
+	 */
+	proxy_resched_idle(rq);
+
+	WARN_ON(p == rq->curr);
+
+	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+	proxy_set_task_cpu(p, target_cpu);
+
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	attach_one_task(target_rq, p);
+
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
+}
+
+static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p)
+{
+	struct rq *this_rq, *target_rq;
+	struct rq_flags this_rf;
+	int cpu, wake_flag = WF_TTWU;
+
+	lockdep_assert_rq_held(rq);
+	WARN_ON(p == rq->curr);
+
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+
+	/*
+	 * We drop the rq lock, and re-grab task_rq_lock to get
+	 * the pi_lock (needed for select_task_rq) as well.
+	 */
+	this_rq = task_rq_lock(p, &this_rf);
+
+	/*
+	 * Since we let go of the rq lock, the task may have been
+	 * woken or migrated to another rq before we  got the
+	 * task_rq_lock. So re-check we're on the same RQ. If
+	 * not, the task has already been migrated and that CPU
+	 * will handle any futher migrations.
+	 */
+	if (this_rq != rq)
+		goto err_out;
+
+	/* Similarly, if we've been dequeued, someone else will wake us */
+	if (!task_on_rq_queued(p))
+		goto err_out;
+
+	/*
+	 * Since we should only be calling here from __schedule()
+	 * -> find_proxy_task(), no one else should have
+	 * assigned current out from under us. But check and warn
+	 * if we see this, then bail.
+	 */
+	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
+		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
+			  __func__, cpu_of(this_rq),
+			  p->comm, p->pid, p->on_cpu);
+		goto err_out;
 	}
-	return NULL;
+
+	update_rq_clock(this_rq);
+	proxy_resched_idle(this_rq);
+	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
+	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
+	set_task_cpu(p, cpu);
+	target_rq = cpu_rq(cpu);
+	clear_task_blocked_on(p, NULL);
+	task_rq_unlock(this_rq, p, &this_rf);
+
+	attach_one_task(target_rq, p);
+
+	/* Finally, re-grab the origianl rq lock and return to pick-again */
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
+	return;
+
+err_out:
+	task_rq_unlock(this_rq, p, &this_rf);
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
 }
 
 /*
@@ -6627,17 +6755,25 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
-	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
+	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
 	struct task_struct *owner = NULL;
+	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
 	struct mutex *mutex;
+	int owner_cpu;
 
 	/* Follow blocked_on chain. */
 	for (p = donor; (mutex = p->blocked_on); p = owner) {
-		/* if its PROXY_WAKING, resched_idle so ttwu can complete */
-		if (mutex == PROXY_WAKING)
-			return proxy_resched_idle(rq);
+		/* if its PROXY_WAKING, do return migration or run if current */
+		if (mutex == PROXY_WAKING) {
+			if (task_current(rq, p)) {
+				clear_task_blocked_on(p, PROXY_WAKING);
+				return p;
+			}
+			action = NEEDS_RETURN;
+			break;
+		}
 
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
@@ -6657,26 +6793,41 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			return NULL;
 		}
 
+		if (task_current(rq, p))
+			curr_in_chain = true;
+
 		owner = __mutex_owner(mutex);
 		if (!owner) {
 			/*
-			 * If there is no owner, clear blocked_on
-			 * and return p so it can run and try to
-			 * acquire the lock
+			 * If there is no owner, either clear blocked_on
+			 * and return p (if it is current and safe to
+			 * just run on this rq), or return-migrate the task.
 			 */
-			__clear_task_blocked_on(p, mutex);
-			return p;
+			if (task_current(rq, p)) {
+				__clear_task_blocked_on(p, NULL);
+				return p;
+			}
+			action = NEEDS_RETURN;
+			break;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
 			action = DEACTIVATE_DONOR;
 			break;
 		}
 
-		if (task_cpu(owner) != this_cpu) {
-			/* XXX Don't handle migrations yet */
-			action = DEACTIVATE_DONOR;
+		owner_cpu = task_cpu(owner);
+		if (owner_cpu != this_cpu) {
+			/*
+			 * @owner can disappear, simply migrate to @owner_cpu
+			 * and leave that CPU to sort things out.
+			 */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
+			action = MIGRATE;
 			break;
 		}
 
@@ -6738,7 +6889,17 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	/* Handle actions we need to do outside of the guard() scope */
 	switch (action) {
 	case DEACTIVATE_DONOR:
-		return proxy_deactivate(rq, donor);
+		if (proxy_deactivate(rq, donor))
+			return NULL;
+		/* If deactivate fails, force return */
+		p = donor;
+		fallthrough;
+	case NEEDS_RETURN:
+		proxy_force_return(rq, rf, p);
+		return NULL;
+	case MIGRATE:
+		proxy_migrate_task(rq, rf, p, owner_cpu);
+		return NULL;
 	case FOUND:
 		/* fallthrough */;
 	}
-- 
2.53.0.880.g73c4285caa-goog


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-13  2:30 ` [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
@ 2026-03-13 13:48   ` Juri Lelli
  2026-03-13 17:53     ` John Stultz
  2026-03-15 16:26   ` K Prateek Nayak
  1 sibling, 1 reply; 38+ messages in thread
From: Juri Lelli @ 2026-03-13 13:48 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, K Prateek Nayak, Peter Zijlstra, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello,

On 13/03/26 02:30, John Stultz wrote:

...

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index d08b004293234..4e746f4de6529 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -2801,12 +2801,24 @@ static int find_later_rq(struct task_struct *task)
>  
>  static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
>  {
> -	struct task_struct *p;
> +	struct task_struct *p = NULL;
> +	struct rb_node *next_node;
>  
>  	if (!has_pushable_dl_tasks(rq))
>  		return NULL;
>  
> -	p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
> +	next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root);
> +	while (next_node) {
> +		p = __node_2_pdl(next_node);
> +		/* make sure task isn't on_cpu (possible with proxy-exec) */
> +		if (!task_on_cpu(rq, p))
> +			break;
> +
> +		next_node = rb_next(next_node);
> +	}
> +
> +	if (!p)
> +		return NULL;

Can't this return an on_cpu task if we hit the corner case where all
pushable tasks are on_cpu?

Thanks,
Juri


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-13 13:48   ` Juri Lelli
@ 2026-03-13 17:53     ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2026-03-13 17:53 UTC (permalink / raw)
  To: Juri Lelli
  Cc: LKML, K Prateek Nayak, Peter Zijlstra, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Mar 13, 2026 at 6:48 AM Juri Lelli <juri.lelli@redhat.com> wrote:
>
> Hello,
>
> On 13/03/26 02:30, John Stultz wrote:
>
> ...
>
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index d08b004293234..4e746f4de6529 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -2801,12 +2801,24 @@ static int find_later_rq(struct task_struct *task)
> >
> >  static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
> >  {
> > -     struct task_struct *p;
> > +     struct task_struct *p = NULL;
> > +     struct rb_node *next_node;
> >
> >       if (!has_pushable_dl_tasks(rq))
> >               return NULL;
> >
> > -     p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
> > +     next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root);
> > +     while (next_node) {
> > +             p = __node_2_pdl(next_node);
> > +             /* make sure task isn't on_cpu (possible with proxy-exec) */
> > +             if (!task_on_cpu(rq, p))
> > +                     break;
> > +
> > +             next_node = rb_next(next_node);
> > +     }
> > +
> > +     if (!p)
> > +             return NULL;
>
> Can't this return an on_cpu task if we hit the corner case where all
> pushable tasks are on_cpu?
>

Oof. Yep. Thanks for catching this!

Let me know if you have any feedback on the rest of the series. I'll
try reviewers a chance to catch anything else, and will respin this
next week.

Really appreciate the review!
thanks
-john

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-13  2:30 ` [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
  2026-03-13 13:48   ` Juri Lelli
@ 2026-03-15 16:26   ` K Prateek Nayak
  2026-03-17  4:49     ` John Stultz
  1 sibling, 1 reply; 38+ messages in thread
From: K Prateek Nayak @ 2026-03-15 16:26 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 3/13/2026 8:00 AM, John Stultz wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b7f77c165a6e0..d86d648a75a4b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6702,23 +6702,6 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  }
>  #endif /* SCHED_PROXY_EXEC */
>  
> -static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
> -{
> -	if (!sched_proxy_exec())
> -		return;
> -	/*
> -	 * pick_next_task() calls set_next_task() on the chosen task
> -	 * at some point, which ensures it is not push/pullable.
> -	 * However, the chosen/donor task *and* the mutex owner form an
> -	 * atomic pair wrt push/pull.
> -	 *
> -	 * Make sure owner we run is not pushable. Unfortunately we can
> -	 * only deal with that by means of a dequeue/enqueue cycle. :-/
> -	 */
> -	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
> -	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
> -}
> -
>  /*
>   * __schedule() is the main scheduler function.
>   *
> @@ -6871,9 +6854,6 @@ static void __sched notrace __schedule(int sched_mode)
>  		 */
>  		RCU_INIT_POINTER(rq->curr, next);
>  
> -		if (!task_current_donor(rq, next))
> -			proxy_tag_curr(rq, next);
> -

Back to my concern with the queuing of the balance_callback, and the
deadline and RT folks can keep me honest here, consider the following:

    CPU0
    ====

  ======> Task A (prio: 80)
  ...
  
  mutex_lock(Mutex0)
  ... /* Executing critical section. */

    =====> Interrupt: Wakes up Task B (prio: 50); B->blocked_on = Mutex0;
      resched_curr()
    <===== Interrupt return
  preempt_schedule_irq()
    schedule()
      put_prev_set_next_Task(A, B)
      rq->donor = B
      if (task_is_blocked(B)
        next = find_proxy_task() /* Return Task A */
      rq->curr = A
      queue_balance_callback()
    do_balance_callbacks()
      /* Finds A as task_on_cpu(); Does nothing. */

  ... /* returns from schedule */
  ... /* continues with critical section */

  mutex_unlock(Mutex0)
    mutex_handoff(B /* Task B */)
    preempt_disable()
      try_to_wake_up()
        resched_curr()
    preempt_enable()
      preempt_schedule()
        proxy_force_return()
          /* Returns to same CPU */

        /*
         * put_prev_set_next_task() is skipped since
         * rq->donor context is same. no balance
         * callbacks are queued. Task A still on the
         * push list.
         */
        rq->donor = B
        rq->curr = B

  =======> sched_out: Task A

  !!! No balance callback; Task A still on push list. !!!
  
  <======= sched_in: Task B


So what I'm getting to is, if we find that rq->donor has not changed
with sched_proxy_exec() but rq->curr has changed during schedule(), we
should forcefully do a:

  prev->sched_class->put_prev_task(rq, rq->donor, rq->donor /* or rq->idle / NULL ? */);
  next->sched_class->set_next_task(rq, rq->donor, true /* to queue balance callback. */);

That way, when we do set_nex_task(), we see if we potentially have
tasks in the push list and queue a balance callback since the
task_on_cpu() condition may no longer apply to the tasks left behind
on the list.

Thoughts?

>  		/*
>  		 * The membarrier system call requires each architecture
>  		 * to have a full memory barrier after updating
> @@ -6907,10 +6887,6 @@ static void __sched notrace __schedule(int sched_mode)
>  		/* Also unlocks the rq: */
>  		rq = context_switch(rq, prev, next, &rf);
>  	} else {
> -		/* In case next was already curr but just got blocked_donor */
> -		if (!task_current_donor(rq, next))
> -			proxy_tag_curr(rq, next);
> -
>  		rq_unpin_lock(rq, &rf);
>  		__balance_callbacks(rq, NULL);
>  		raw_spin_rq_unlock_irq(rq);

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 8/9] sched: Move attach_one_task and attach_task helpers to sched.h
  2026-03-13  2:30 ` [PATCH v25 8/9] sched: Move attach_one_task and attach_task helpers to sched.h John Stultz
@ 2026-03-15 16:34   ` K Prateek Nayak
  2026-03-16 23:34     ` John Stultz
  0 siblings, 1 reply; 38+ messages in thread
From: K Prateek Nayak @ 2026-03-15 16:34 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 3/13/2026 8:00 AM, John Stultz wrote:
> +/*
> + * attach_one_task() -- attaches the task returned from detach_one_task() to
> + * its new rq.
> + */
> +static inline void attach_one_task(struct rq *rq, struct task_struct *p)
> +{
> +	struct rq_flags rf;
> +
> +	rq_lock(rq, &rf);

nit. We can now use guard(rq_lock)(rq) and save on needing to declare a
"rf". Apart from that, feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

> +	update_rq_clock(rq);
> +	attach_task(rq, p);
> +	rq_unlock(rq, &rf);
> +}
> +
-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 2/9] sched: Minimise repeated sched_proxy_exec() checking
  2026-03-13  2:30 ` [PATCH v25 2/9] sched: Minimise repeated sched_proxy_exec() checking John Stultz
@ 2026-03-15 17:01   ` K Prateek Nayak
  0 siblings, 0 replies; 38+ messages in thread
From: K Prateek Nayak @ 2026-03-15 17:01 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 3/13/2026 8:00 AM, John Stultz wrote:
> Peter noted: Compilers are really bad (as in they utterly refuse)
> optimizing (even when marked with __pure) the static branch
> things, and will happily emit multiple identical in a row.
> 
> So pull out the one obvious sched_proxy_exec() branch in
> __schedule() and remove some of the 'implicit' ones in that
> path.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: John Stultz <jstultz@google.com>

Feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d86d648a75a4b..84c61496fa263 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6597,11 +6597,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  	struct mutex *mutex;
>  
>  	/* Follow blocked_on chain. */
> -	for (p = donor; task_is_blocked(p); p = owner) {
> -		mutex = p->blocked_on;
> -		/* Something changed in the chain, so pick again */
> -		if (!mutex)
> -			return NULL;

Previously we used to return NULL here when "p->blocked_on" turned NULL
between the check in the loop condition and the read here but from my
analysis on v24, with PROXY_WAKING, "p->blocked_on" can never transition
to NULL with rq_lock() held for a queued task (it can only transition
from a valid mutex to PROXY_WAKING outside the rq_lock, both of which
are not NULL) so we should never hit this condition now with the new
state and this should be safe :-)


> +	for (p = donor; (mutex = p->blocked_on); p = owner) {
>  		/*
>  		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
>  		 * and ensure @owner sticks around.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)
  2026-03-13  2:30 ` [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration) John Stultz
@ 2026-03-15 17:38   ` K Prateek Nayak
  2026-03-18 19:07     ` John Stultz
  2026-03-18  6:35   ` Juri Lelli
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: K Prateek Nayak @ 2026-03-15 17:38 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 3/13/2026 8:00 AM, John Stultz wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index af497b8c72dce..fe20204cf51cc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3643,6 +3643,23 @@ void update_rq_avg_idle(struct rq *rq)
>  	rq->idle_stamp = 0;
>  }
>  
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
> +{
> +	unsigned int wake_cpu;
> +
> +	/*
> +	 * Since we are enqueuing a blocked task on a cpu it may
> +	 * not be able to run on, preserve wake_cpu when we
> +	 * __set_task_cpu so we can return the task to where it
> +	 * was previously runnable.
> +	 */
> +	wake_cpu = p->wake_cpu;
> +	__set_task_cpu(p, cpu);
> +	p->wake_cpu = wake_cpu;
> +}
> +#endif /* CONFIG_SCHED_PROXY_EXEC */
> +
>  static void
>  ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>  		 struct rq_flags *rf)
> @@ -4242,13 +4259,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  		ttwu_queue(p, cpu, wake_flags);
>  	}
>  out:
> -	/*
> -	 * For now, if we've been woken up, clear the task->blocked_on
> -	 * regardless if it was set to a mutex or PROXY_WAKING so the
> -	 * task can run. We will need to be more careful later when
> -	 * properly handling proxy migration
> -	 */
> -	clear_task_blocked_on(p, NULL);

So, for this bit, there are mutex variants that are interruptible and
killable which probably benefits from clearing the blocked_on
relation.

For potential proxy task that are still queued, we'll hit the
ttwu_runnable() path and resched out of there so it makes sense to
mark them as PROXY_WAKING so schedule() can return migrate them, they
run and  hit the signal_pending_state() check in __mutex_lock_common()
loop, and return -EINTR.

Otherwise, if they need a full wakeup, they may be blocked on a
sleeping owner, in which case it is beneficial to clear blocked_on, do
a full wakeup. and let them run to evaluate the pending signal.

ttwu_state_match() should filter out any spurious signals. Thoughts? 

>  	if (success)
>  		ttwu_stat(p, task_cpu(p), wake_flags);
>  
> @@ -6575,7 +6585,7 @@ static inline struct task_struct *proxy_resched_idle(struct rq *rq)
>  	return rq->idle;
>  }
>  
> -static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
>  {
>  	unsigned long state = READ_ONCE(donor->__state);
>  
> @@ -6595,17 +6605,135 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
>  	return try_to_block_task(rq, donor, &state, true);
>  }
>  
> -static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +/*
> + * If the blocked-on relationship crosses CPUs, migrate @p to the
> + * owner's CPU.
> + *
> + * This is because we must respect the CPU affinity of execution
> + * contexts (owner) but we can ignore affinity for scheduling
> + * contexts (@p). So we have to move scheduling contexts towards
> + * potential execution contexts.
> + *
> + * Note: The owner can disappear, but simply migrate to @target_cpu
> + * and leave that CPU to sort things out.
> + */
> +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p, int target_cpu)
>  {
> -	if (!__proxy_deactivate(rq, donor)) {
> -		/*
> -		 * XXX: For now, if deactivation failed, set donor
> -		 * as unblocked, as we aren't doing proxy-migrations
> -		 * yet (more logic will be needed then).
> -		 */
> -		clear_task_blocked_on(donor, NULL);
> +	struct rq *target_rq = cpu_rq(target_cpu);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
> +	 * otherwise we have a reference that no longer belongs to us.
> +	 *
> +	 * Additionally, as we put_prev_task(prev) earlier, its possible that
> +	 * prev will migrate away as soon as we drop the rq lock, however we
> +	 * still have it marked as rq->curr, as we've not yet switched tasks.
> +	 *
> +	 * So call proxy_resched_idle() to let go of the references before
> +	 * we release the lock.
> +	 */
> +	proxy_resched_idle(rq);
> +
> +	WARN_ON(p == rq->curr);
> +
> +	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +	proxy_set_task_cpu(p, target_cpu);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	attach_one_task(target_rq, p);
> +
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
> +}
> +
> +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p)
> +{
> +	struct rq *this_rq, *target_rq;
> +	struct rq_flags this_rf;
> +	int cpu, wake_flag = WF_TTWU;
> +
> +	lockdep_assert_rq_held(rq);
> +	WARN_ON(p == rq->curr);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	/*
> +	 * We drop the rq lock, and re-grab task_rq_lock to get
> +	 * the pi_lock (needed for select_task_rq) as well.
> +	 */
> +	this_rq = task_rq_lock(p, &this_rf);
> +
> +	/*
> +	 * Since we let go of the rq lock, the task may have been
> +	 * woken or migrated to another rq before we  got the
> +	 * task_rq_lock. So re-check we're on the same RQ. If
> +	 * not, the task has already been migrated and that CPU
> +	 * will handle any futher migrations.
> +	 */
> +	if (this_rq != rq)
> +		goto err_out;
> +
> +	/* Similarly, if we've been dequeued, someone else will wake us */
> +	if (!task_on_rq_queued(p))
> +		goto err_out;
> +
> +	/*
> +	 * Since we should only be calling here from __schedule()
> +	 * -> find_proxy_task(), no one else should have
> +	 * assigned current out from under us. But check and warn
> +	 * if we see this, then bail.
> +	 */
> +	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> +		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> +			  __func__, cpu_of(this_rq),
> +			  p->comm, p->pid, p->on_cpu);
> +		goto err_out;
>  	}
> -	return NULL;
> +
> +	update_rq_clock(this_rq);
> +	proxy_resched_idle(this_rq);

I still think this is too late, and only required if we are moving the
donor. Can we do this before we drop the rq_lock so that a remote
wakeup doesn't need to clear the this? (although I think we don't have
that bit in the ttwu path anymore and we rely on the schedule() bits
completely for return migration on this version - any particular
reason?).

> +	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
> +	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> +	set_task_cpu(p, cpu);
> +	target_rq = cpu_rq(cpu);
> +	clear_task_blocked_on(p, NULL);
> +	task_rq_unlock(this_rq, p, &this_rf);
> +
> +	attach_one_task(target_rq, p);

I'm still having a hard time believing we cannot use wake_up_process()
but let me look more into that tomorrow when the sun rises.

> +
> +	/* Finally, re-grab the origianl rq lock and return to pick-again */
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
> +	return;
> +
> +err_out:
> +	task_rq_unlock(this_rq, p, &this_rf);
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
>  }
>  
>  /*
-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 8/9] sched: Move attach_one_task and attach_task helpers to sched.h
  2026-03-15 16:34   ` K Prateek Nayak
@ 2026-03-16 23:34     ` John Stultz
  2026-03-17  2:29       ` K Prateek Nayak
  0 siblings, 1 reply; 38+ messages in thread
From: John Stultz @ 2026-03-16 23:34 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Sun, Mar 15, 2026 at 9:34 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 3/13/2026 8:00 AM, John Stultz wrote:
> > +/*
> > + * attach_one_task() -- attaches the task returned from detach_one_task() to
> > + * its new rq.
> > + */
> > +static inline void attach_one_task(struct rq *rq, struct task_struct *p)
> > +{
> > +     struct rq_flags rf;
> > +
> > +     rq_lock(rq, &rf);
>
> nit. We can now use guard(rq_lock)(rq) and save on needing to declare a
> "rf". Apart from that, feel free to include:

I actually did this in a later patch in the full series, as it seemed
more clear I wasn't modifying logic when moving the code:
  https://github.com/johnstultz-work/linux-dev/commit/8d8a12278d81ce81af6b0dfd051750f4ce2ec0e5

But, given your feedback, I'll go ahead and fold that fix down to this
change and add a note in the commit message.

Thanks as always for the review and feedback!
-john

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 8/9] sched: Move attach_one_task and attach_task helpers to sched.h
  2026-03-16 23:34     ` John Stultz
@ 2026-03-17  2:29       ` K Prateek Nayak
  0 siblings, 0 replies; 38+ messages in thread
From: K Prateek Nayak @ 2026-03-17  2:29 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 3/17/2026 5:04 AM, John Stultz wrote:
> On Sun, Mar 15, 2026 at 9:34 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>> On 3/13/2026 8:00 AM, John Stultz wrote:
>>> +/*
>>> + * attach_one_task() -- attaches the task returned from detach_one_task() to
>>> + * its new rq.
>>> + */
>>> +static inline void attach_one_task(struct rq *rq, struct task_struct *p)
>>> +{
>>> +     struct rq_flags rf;
>>> +
>>> +     rq_lock(rq, &rf);
>>
>> nit. We can now use guard(rq_lock)(rq) and save on needing to declare a
>> "rf". Apart from that, feel free to include:
> 
> I actually did this in a later patch in the full series, as it seemed
> more clear I wasn't modifying logic when moving the code:
>   https://github.com/johnstultz-work/linux-dev/commit/8d8a12278d81ce81af6b0dfd051750f4ce2ec0e5
> 
> But, given your feedback, I'll go ahead and fold that fix down to this
> change and add a note in the commit message.

Thanks a ton! That reminds me that I should go look at the full proxy
series once again. I've queued some usual benchmark runs for these bits
on top of tip - I've hopeful I've sorted the stuff with longer running
benchmarks this time around; will report back once they are done.

Again, thank you for incorporating the suggestions and reworking the
series. Much appreciated _/\_

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-15 16:26   ` K Prateek Nayak
@ 2026-03-17  4:49     ` John Stultz
  2026-03-17  5:41       ` K Prateek Nayak
  0 siblings, 1 reply; 38+ messages in thread
From: John Stultz @ 2026-03-17  4:49 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Sun, Mar 15, 2026 at 9:27 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Back to my concern with the queuing of the balance_callback, and the
> deadline and RT folks can keep me honest here, consider the following:
>
>     CPU0
>     ====
>
>   ======> Task A (prio: 80)
>   ...
>
>   mutex_lock(Mutex0)
>   ... /* Executing critical section. */
>
>     =====> Interrupt: Wakes up Task B (prio: 50); B->blocked_on = Mutex0;
>       resched_curr()
>     <===== Interrupt return
>   preempt_schedule_irq()
>     schedule()
>       put_prev_set_next_Task(A, B)
>       rq->donor = B
>       if (task_is_blocked(B)
>         next = find_proxy_task() /* Return Task A */
>       rq->curr = A
>       queue_balance_callback()
>     do_balance_callbacks()
>       /* Finds A as task_on_cpu(); Does nothing. */
>
>   ... /* returns from schedule */
>   ... /* continues with critical section */
>
>   mutex_unlock(Mutex0)
>     mutex_handoff(B /* Task B */)
>     preempt_disable()
>       try_to_wake_up()
>         resched_curr()
>     preempt_enable()
>       preempt_schedule()
>         proxy_force_return()
>           /* Returns to same CPU */
>
>         /*
>          * put_prev_set_next_task() is skipped since
>          * rq->donor context is same. no balance
>          * callbacks are queued. Task A still on the
>          * push list.
>          */
>         rq->donor = B
>         rq->curr = B
>   =======> sched_out: Task A
>
>   !!! No balance callback; Task A still on push list. !!!
>
>   <======= sched_in: Task B

Hrm. I'm feeling like I'm a little lost here, specifically after
proxy_force_return(), since it doesn't exist yet at this point in the
patch series. But assuming we're at the "Handle blocked-waiter
migration" point in the series, I'd think it would be something like:

rq->donor= B
rq->curr = A
<< task A >>
mutex_unlock(Mutex0)
  mutex_handoff(B /* Task B */)
   preempt_disable()
     try_to_wake_up()
       resched_curr()
    preempt_enable()
      preempt_schedule()
        __schedule()
          find_proxy_task()
             proxy_force_return()
             return NULL
        pick_again:
          next = pick_next_task()
                      __pick_next_task() /* Returns B */
          rq->donor =B
          rq->curr = B
          context_switch()
<<switch to B >>
          finish_task_switch()
            finish_lock_switch()
              __balance_callbacks()

Your point "put_prev_set_next_task() is skipped since rq->donor
context is same" wasn't initially obvious to me, as the fair scheduler
does have a (p == prev) check, but it doesn't enqueue balance
callbacks.  And for RT/DL/SCX we should be using the pick_task()
method, which calls put_prev_set_next_task() in __pick_next_task().
But indeed, *inside* of put_prev_set_next_task() we return early if
(next == prev).

So I see your concern and agree.

> So what I'm getting to is, if we find that rq->donor has not changed
> with sched_proxy_exec() but rq->curr has changed during schedule(), we
> should forcefully do a:
>
>   prev->sched_class->put_prev_task(rq, rq->donor, rq->donor /* or rq->idle / NULL ? */);
>   next->sched_class->set_next_task(rq, rq->donor, true /* to queue balance callback. */);
>
> That way, when we do set_nex_task(), we see if we potentially have
> tasks in the push list and queue a balance callback since the
> task_on_cpu() condition may no longer apply to the tasks left behind
> on the list.
>
> Thoughts?

Yeah. I wonder if we can express this inside of
put_prev_set_next_task(). Reworking the shortcut to maybe:
    if (next == prev && next != rq->curr)

I probably need to think on this tomorrow, as I suspect the above has
some holes, but it seems like it would catch the cases that would
matter (maybe the issue is it catches too much - we'd probably also
trip it if we A boosted B, and then we hit schedule and again chose A
to boost B, which we probably could have skipped).

I guess adding a new helper function to manually do the
put_prev/set_next could be added to the top level __schedule() logic
in the (prev != next) case, though we'll have to preserve the
prev_donor on the stack probably.

thanks
-john

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-17  4:49     ` John Stultz
@ 2026-03-17  5:41       ` K Prateek Nayak
  2026-03-17  6:04         ` John Stultz
  2026-03-18 12:55         ` Peter Zijlstra
  0 siblings, 2 replies; 38+ messages in thread
From: K Prateek Nayak @ 2026-03-17  5:41 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 3/17/2026 10:19 AM, John Stultz wrote:
> On Sun, Mar 15, 2026 at 9:27 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Back to my concern with the queuing of the balance_callback, and the
>> deadline and RT folks can keep me honest here, consider the following:
>>
>>     CPU0
>>     ====
>>
>>   ======> Task A (prio: 80)
>>   ...
>>
>>   mutex_lock(Mutex0)
>>   ... /* Executing critical section. */
>>
>>     =====> Interrupt: Wakes up Task B (prio: 50); B->blocked_on = Mutex0;
>>       resched_curr()
>>     <===== Interrupt return
>>   preempt_schedule_irq()
>>     schedule()
>>       put_prev_set_next_Task(A, B)
>>       rq->donor = B
>>       if (task_is_blocked(B)
>>         next = find_proxy_task() /* Return Task A */
>>       rq->curr = A
>>       queue_balance_callback()
>>     do_balance_callbacks()
>>       /* Finds A as task_on_cpu(); Does nothing. */
>>
>>   ... /* returns from schedule */
>>   ... /* continues with critical section */
>>
>>   mutex_unlock(Mutex0)
>>     mutex_handoff(B /* Task B */)
>>     preempt_disable()
>>       try_to_wake_up()
>>         resched_curr()
>>     preempt_enable()
>>       preempt_schedule()
>>         proxy_force_return()
>>           /* Returns to same CPU */
>>
>>         /*
>>          * put_prev_set_next_task() is skipped since
>>          * rq->donor context is same. no balance
>>          * callbacks are queued. Task A still on the
>>          * push list.
>>          */
>>         rq->donor = B
>>         rq->curr = B
>>   =======> sched_out: Task A
>>
>>   !!! No balance callback; Task A still on push list. !!!
>>
>>   <======= sched_in: Task B
> 
> Hrm. I'm feeling like I'm a little lost here, specifically after
> proxy_force_return(), since it doesn't exist yet at this point in the
> patch series.

Yeah I had to look a little bit ahead to poke holes here. Sorry about
that!

> But assuming we're at the "Handle blocked-waiter
> migration" point in the series, I'd think it would be something like:
> 
> rq->donor= B
> rq->curr = A
> << task A >>
> mutex_unlock(Mutex0)
>   mutex_handoff(B /* Task B */)
>    preempt_disable()
>      try_to_wake_up()
>        resched_curr()
>     preempt_enable()
>       preempt_schedule()
>         __schedule()
>           find_proxy_task()
>              proxy_force_return()
>              return NULL
>         pick_again:
>           next = pick_next_task()
>                       __pick_next_task() /* Returns B */
>           rq->donor =B
>           rq->curr = B
>           context_switch()
> <<switch to B >>
>           finish_task_switch()
>             finish_lock_switch()
>               __balance_callbacks()
> 
> Your point "put_prev_set_next_task() is skipped since rq->donor
> context is same" wasn't initially obvious to me, as the fair scheduler
> does have a (p == prev) check, but it doesn't enqueue balance
> callbacks.  And for RT/DL/SCX we should be using the pick_task()
> method, which calls put_prev_set_next_task() in __pick_next_task().
> But indeed, *inside* of put_prev_set_next_task() we return early if
> (next == prev).
> 
> So I see your concern and agree.
> 
>> So what I'm getting to is, if we find that rq->donor has not changed
>> with sched_proxy_exec() but rq->curr has changed during schedule(), we
>> should forcefully do a:
>>
>>   prev->sched_class->put_prev_task(rq, rq->donor, rq->donor /* or rq->idle / NULL ? */);
>>   next->sched_class->set_next_task(rq, rq->donor, true /* to queue balance callback. */);
>>
>> That way, when we do set_nex_task(), we see if we potentially have
>> tasks in the push list and queue a balance callback since the
>> task_on_cpu() condition may no longer apply to the tasks left behind
>> on the list.
>>
>> Thoughts?
> 
> Yeah. I wonder if we can express this inside of
> put_prev_set_next_task(). Reworking the shortcut to maybe:
>     if (next == prev && next != rq->curr)
> 
> I probably need to think on this tomorrow, as I suspect the above has
> some holes, but it seems like it would catch the cases that would
> matter

Also this needs to be done after find_proxy_task() since
"donor->blocked_on" needs to be cleared to queue callbacks else we'll
bail out on task_is_blocked() check in set_next_task.*() with
PROXY_WAKING when it is done as a part of pick_next_task().

> (maybe the issue is it catches too much - we'd probably also
> trip it if we A boosted B, and then we hit schedule and again chose A
> to boost B, which we probably could have skipped).

Ack! Doing it when the execution context changes with donor context
remaining the same would be the most optimal.

> 
> I guess adding a new helper function to manually do the
> put_prev/set_next could be added to the top level __schedule() logic
> in the (prev != next) case, though we'll have to preserve the
> prev_donor on the stack probably.

That seems like the best option to me too.

Also, deadline, RT, fair, and idle don't really care about the "next"
argument of put_prev_task() and the only one that does care is
put_prev_task_scx() to call switch_class() callback so putting it as
either NULL or "rq->donor" should be safe.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-17  5:41       ` K Prateek Nayak
@ 2026-03-17  6:04         ` John Stultz
  2026-03-17  7:52           ` K Prateek Nayak
  2026-03-18 13:36           ` Peter Zijlstra
  2026-03-18 12:55         ` Peter Zijlstra
  1 sibling, 2 replies; 38+ messages in thread
From: John Stultz @ 2026-03-17  6:04 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Mon, Mar 16, 2026 at 10:41 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 3/17/2026 10:19 AM, John Stultz wrote:
> >
> > I guess adding a new helper function to manually do the
> > put_prev/set_next could be added to the top level __schedule() logic
> > in the (prev != next) case, though we'll have to preserve the
> > prev_donor on the stack probably.
>
> That seems like the best option to me too.
>
> Also, deadline, RT, fair, and idle don't really care about the "next"
> argument of put_prev_task() and the only one that does care is
> put_prev_task_scx() to call switch_class() callback so putting it as
> either NULL or "rq->donor" should be safe.

Ack.
Here's the change I'm testing tonight (against 6.18):
https://github.com/johnstultz-work/linux-dev/commit/0cc72a4923143f496e33711cbcc1afdf6d861ca6

Feel free to suggest a better name for the helper function. It feels a
little clunky (and sort of sad right after getting rid of the clunky
proxy_tag_curr(), to re-add something so similar).

Thanks again for your thoughtful feedback!
-john

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-17  6:04         ` John Stultz
@ 2026-03-17  7:52           ` K Prateek Nayak
  2026-03-17 18:35             ` John Stultz
  2026-03-18 13:36           ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: K Prateek Nayak @ 2026-03-17  7:52 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team


Hello John,

On 3/17/2026 11:34 AM, John Stultz wrote:
> On Mon, Mar 16, 2026 at 10:41 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>> On 3/17/2026 10:19 AM, John Stultz wrote:
>>>
>>> I guess adding a new helper function to manually do the
>>> put_prev/set_next could be added to the top level __schedule() logic
>>> in the (prev != next) case, though we'll have to preserve the
>>> prev_donor on the stack probably.
>>
>> That seems like the best option to me too.
>>
>> Also, deadline, RT, fair, and idle don't really care about the "next"
>> argument of put_prev_task() and the only one that does care is
>> put_prev_task_scx() to call switch_class() callback so putting it as
>> either NULL or "rq->donor" should be safe.
> 
> Ack.
> Here's the change I'm testing tonight (against 6.18):
> https://github.com/johnstultz-work/linux-dev/commit/0cc72a4923143f496e33711cbcc1afdf6d861ca6

Thanks a ton for pushing out to WIP! Only nit. would be a:

  s/rq->donor/prev_donor/

on the lines with {put_prev/set_next}_task() to save on the
additional dereference since they are both the same (but maybe
the complier figures that out on its own?)

Also would a sched_proxy_exec() check within that function
make sense to skip evaluation of that branch entirely when
proxy exec is disabled via cmdline?

> Feel free to suggest a better name for the helper function. It feels a
> little clunky (and sort of sad right after getting rid of the clunky
> proxy_tag_curr(), to re-add something so similar).

I know and I'm sorry about that (T_T)

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-17  7:52           ` K Prateek Nayak
@ 2026-03-17 18:35             ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2026-03-17 18:35 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Peter Zijlstra, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Tue, Mar 17, 2026 at 12:53 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
>
> Hello John,
>
> On 3/17/2026 11:34 AM, John Stultz wrote:
> > On Mon, Mar 16, 2026 at 10:41 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> >> On 3/17/2026 10:19 AM, John Stultz wrote:
> >>>
> >>> I guess adding a new helper function to manually do the
> >>> put_prev/set_next could be added to the top level __schedule() logic
> >>> in the (prev != next) case, though we'll have to preserve the
> >>> prev_donor on the stack probably.
> >>
> >> That seems like the best option to me too.
> >>
> >> Also, deadline, RT, fair, and idle don't really care about the "next"
> >> argument of put_prev_task() and the only one that does care is
> >> put_prev_task_scx() to call switch_class() callback so putting it as
> >> either NULL or "rq->donor" should be safe.
> >
> > Ack.
> > Here's the change I'm testing tonight (against 6.18):
> > https://github.com/johnstultz-work/linux-dev/commit/0cc72a4923143f496e33711cbcc1afdf6d861ca6
>
> Thanks a ton for pushing out to WIP! Only nit. would be a:
>
>   s/rq->donor/prev_donor/
>
> on the lines with {put_prev/set_next}_task() to save on the
> additional dereference since they are both the same (but maybe
> the complier figures that out on its own?)

While the rq->donor and prev_donor are the same, I sort of preferred using:
  prev_donor->sched_class->put_prev_task(rq, prev_donor, rq->donor);
  rq->donor->sched_class->set_next_task(rq, rq->donor, true);

As it matches the familiar prev/next pattern used elsewhere (to me it
visually makes more sense and doesn't look as confusing).
I'd hope the compilers woudl sort out they could save the derefernce,
but maybe I can avoid extra derefernces and still preserve the pattern
with a local "donor" variable? That should make sure its simpler for
the compiler to optimize when they are equivalent.

> Also would a sched_proxy_exec() check within that function
> make sense to skip evaluation of that branch entirely when
> proxy exec is disabled via cmdline?

So after Peter's feedback that the sched_proxy_exec() checks weren't
optimizing out the way I'd hope, I'm hesitant to add another check
there. Seems the donor == prev_donor check would amount to the same
amount of work?

> > Feel free to suggest a better name for the helper function. It feels a
> > little clunky (and sort of sad right after getting rid of the clunky
> > proxy_tag_curr(), to re-add something so similar).
>
> I know and I'm sorry about that (T_T)

Oh, no worries! Again, I really appreciate you catching these subtle issues!
thanks
-john

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)
  2026-03-13  2:30 ` [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration) John Stultz
  2026-03-15 17:38   ` K Prateek Nayak
@ 2026-03-18  6:35   ` Juri Lelli
  2026-03-18  6:56     ` K Prateek Nayak
  2026-03-18 12:59   ` Peter Zijlstra
  2026-03-19 12:49   ` Peter Zijlstra
  3 siblings, 1 reply; 38+ messages in thread
From: Juri Lelli @ 2026-03-18  6:35 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello,

I couldn't convince myself the below is not potentially racy ...

On 13/03/26 02:30, John Stultz wrote:

...

> +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p, int target_cpu)
>  {
> -	if (!__proxy_deactivate(rq, donor)) {
> -		/*
> -		 * XXX: For now, if deactivation failed, set donor
> -		 * as unblocked, as we aren't doing proxy-migrations
> -		 * yet (more logic will be needed then).
> -		 */
> -		clear_task_blocked_on(donor, NULL);
> +	struct rq *target_rq = cpu_rq(target_cpu);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
> +	 * otherwise we have a reference that no longer belongs to us.
> +	 *
> +	 * Additionally, as we put_prev_task(prev) earlier, its possible that
> +	 * prev will migrate away as soon as we drop the rq lock, however we
> +	 * still have it marked as rq->curr, as we've not yet switched tasks.
> +	 *
> +	 * So call proxy_resched_idle() to let go of the references before
> +	 * we release the lock.
> +	 */
> +	proxy_resched_idle(rq);
> +
> +	WARN_ON(p == rq->curr);
> +
> +	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +	proxy_set_task_cpu(p, target_cpu);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */
> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	attach_one_task(target_rq, p);

We release rq lock between deactivate and attach (and we don't hold
neither wait_lock nor blocked_lock as they are out of scope at this
point). Can't something like the following happen?

  - Task A: blocked on mutex M, queued on CPU 0
  - Task B: owns mutex M, running on CPU 1

  CPU 0 (migrating A→CPU 1)        CPU 1 (B finishes critical section)
  -------------------------        ------------------------------------
  find_proxy_task(donor=A):
    owner = B, owner_cpu = 1
    action = MIGRATE
    // guard releases wait_lock

  proxy_migrate_task(A, cpu=1):
    deactivate_task(rq0, A)
      → A->on_rq = 0
    proxy_set_task_cpu(A, 1)
      → A->cpu = 1
    raw_spin_rq_unlock(rq0)
      → RQ0 LOCK RELEASED
                                   // Task B running
                                   mutex_unlock(M):
                                     lock(&M->wait_lock)   // ← Can grab it
                                     A->blocked_on = PROXY_WAKING
                                     unlock(&M->wait_lock)
                                     wake_up_q():
                                       try_to_wake_up(A):
                                         sees A->on_rq == 0
                                         cpu = select_task_rq(A)
                                           → returns CPU 2
                                         set_task_cpu(A, 2)
                                         ttwu_queue(A, 2)
                                           → A enqueued on CPU 2
                                           → A->on_rq = 1, A->cpu = 2

  attach_one_task(rq1, A):
    attach_task(rq1, A):
      WARN_ON_ONCE(task_rq(A) != rq1)
        → Fires! task_rq(A) = rq2
      activate_task(rq1, A)
        → Double-enqueue! A->on_rq already = 1

What am I missing? :)

Thanks,
Juri


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)
  2026-03-18  6:35   ` Juri Lelli
@ 2026-03-18  6:56     ` K Prateek Nayak
  2026-03-18 10:16       ` Juri Lelli
  0 siblings, 1 reply; 38+ messages in thread
From: K Prateek Nayak @ 2026-03-18  6:56 UTC (permalink / raw)
  To: Juri Lelli, John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello Juri,

On 3/18/2026 12:05 PM, Juri Lelli wrote:
>> +	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
>> +	proxy_set_task_cpu(p, target_cpu);
>> +
>> +	/*
>> +	 * We have to zap callbacks before unlocking the rq
>> +	 * as another CPU may jump in and call sched_balance_rq
>> +	 * which can trip the warning in rq_pin_lock() if we
>> +	 * leave callbacks set.
>> +	 */
>> +	zap_balance_callbacks(rq);
>> +	rq_unpin_lock(rq, rf);
>> +	raw_spin_rq_unlock(rq);
>> +
>> +	attach_one_task(target_rq, p);
> 
> We release rq lock between deactivate and attach (and we don't hold
> neither wait_lock nor blocked_lock as they are out of scope at this
> point). Can't something like the following happen?
> 
>   - Task A: blocked on mutex M, queued on CPU 0
>   - Task B: owns mutex M, running on CPU 1
> 
>   CPU 0 (migrating A→CPU 1)        CPU 1 (B finishes critical section)
>   -------------------------        ------------------------------------
>   find_proxy_task(donor=A):
>     owner = B, owner_cpu = 1
>     action = MIGRATE
>     // guard releases wait_lock
> 
>   proxy_migrate_task(A, cpu=1):
>     deactivate_task(rq0, A)
>       → A->on_rq = 0

      This sets TASK_ON_RQ_MIGRATING
      before dequeuing.

      block_task() is the only one
      that clears task->on_rq now.

>     proxy_set_task_cpu(A, 1)
>       → A->cpu = 1
>     raw_spin_rq_unlock(rq0)
>       → RQ0 LOCK RELEASED
>                                    // Task B running
>                                    mutex_unlock(M):
>                                      lock(&M->wait_lock)   // ← Can grab it
>                                      A->blocked_on = PROXY_WAKING
>                                      unlock(&M->wait_lock)
>                                      wake_up_q():
>                                        try_to_wake_up(A):

                                     CPU1 see p->on_rq (TASK_ON_RQ_MIGRATING)
                                     and go into ttwu_runnable() and stall
                                     at __task_rq_lock() since it sees
                                     task_on_rq_migrating() ...

    attach is done here
    A->on_rq is set to
    TASK_ON_RQ_QUEUED

                                     ... we come back here see
                                     task_on_rq_queued() and simply do a
                                     wakeup_preempt() and bail out early
                                     from try_to_wake_up() path.

>                                          sees A->on_rq == 0
>                                          cpu = select_task_rq(A)
>                                            → returns CPU 2
>                                          set_task_cpu(A, 2)
>                                          ttwu_queue(A, 2)
>                                            → A enqueued on CPU 2
>                                            → A->on_rq = 1, A->cpu = 2
> 
>   attach_one_task(rq1, A):
>     attach_task(rq1, A):
>       WARN_ON_ONCE(task_rq(A) != rq1)
>         → Fires! task_rq(A) = rq2
>       activate_task(rq1, A)
>         → Double-enqueue! A->on_rq already = 1

Thus, we avoid that unless I'm mistaken :-)

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)
  2026-03-18  6:56     ` K Prateek Nayak
@ 2026-03-18 10:16       ` Juri Lelli
  0 siblings, 0 replies; 38+ messages in thread
From: Juri Lelli @ 2026-03-18 10:16 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On 18/03/26 12:26, K Prateek Nayak wrote:
> Hello Juri,
> 
> On 3/18/2026 12:05 PM, Juri Lelli wrote:
> >> +	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> >> +	proxy_set_task_cpu(p, target_cpu);
> >> +
> >> +	/*
> >> +	 * We have to zap callbacks before unlocking the rq
> >> +	 * as another CPU may jump in and call sched_balance_rq
> >> +	 * which can trip the warning in rq_pin_lock() if we
> >> +	 * leave callbacks set.
> >> +	 */
> >> +	zap_balance_callbacks(rq);
> >> +	rq_unpin_lock(rq, rf);
> >> +	raw_spin_rq_unlock(rq);
> >> +
> >> +	attach_one_task(target_rq, p);
> > 
> > We release rq lock between deactivate and attach (and we don't hold
> > neither wait_lock nor blocked_lock as they are out of scope at this
> > point). Can't something like the following happen?
> > 
> >   - Task A: blocked on mutex M, queued on CPU 0
> >   - Task B: owns mutex M, running on CPU 1
> > 
> >   CPU 0 (migrating A→CPU 1)        CPU 1 (B finishes critical section)
> >   -------------------------        ------------------------------------
> >   find_proxy_task(donor=A):
> >     owner = B, owner_cpu = 1
> >     action = MIGRATE
> >     // guard releases wait_lock
> > 
> >   proxy_migrate_task(A, cpu=1):
> >     deactivate_task(rq0, A)
> >       → A->on_rq = 0
> 
>       This sets TASK_ON_RQ_MIGRATING
>       before dequeuing.

Right you are, I missed this!

Sorry for the noise and thanks for the quick reply.

Best,
Juri


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-17  5:41       ` K Prateek Nayak
  2026-03-17  6:04         ` John Stultz
@ 2026-03-18 12:55         ` Peter Zijlstra
  2026-03-18 18:01           ` K Prateek Nayak
  1 sibling, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2026-03-18 12:55 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Tue, Mar 17, 2026 at 11:11:20AM +0530, K Prateek Nayak wrote:

> Also, deadline, RT, fair, and idle don't really care about the "next"
> argument of put_prev_task() and the only one that does care is
> put_prev_task_scx() to call switch_class() callback so putting it as
> either NULL or "rq->donor" should be safe.

https://lkml.kernel.org/r/20260317104343.225156112@infradead.org

Makes fair care about the @next argument to put_prev_task_fair().

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)
  2026-03-13  2:30 ` [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration) John Stultz
  2026-03-15 17:38   ` K Prateek Nayak
  2026-03-18  6:35   ` Juri Lelli
@ 2026-03-18 12:59   ` Peter Zijlstra
  2026-03-19 12:49   ` Peter Zijlstra
  3 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2026-03-18 12:59 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Mar 13, 2026 at 02:30:10AM +0000, John Stultz wrote:
> Add logic to handle migrating a blocked waiter to a remote
> cpu where the lock owner is runnable.
> 
> Additionally, as the blocked task may not be able to run
> on the remote cpu, add logic to handle return migration once
> the waiting task is given the mutex.
> 
> Because tasks may get migrated to where they cannot run, also
> modify the scheduling classes to avoid sched class migrations on
> mutex blocked tasks, leaving find_proxy_task() and related logic
> to do the migrations and return migrations.
> 
> This was split out from the larger proxy patch, and
> significantly reworked.
> 
> Credits for the original patch go to:
>   Peter Zijlstra (Intel) <peterz@infradead.org>
>   Juri Lelli <juri.lelli@redhat.com>
>   Valentin Schneider <valentin.schneider@arm.com>
>   Connor O'Brien <connoro@google.com>
> 
> Signed-off-by: John Stultz <jstultz@google.com>

This patch wants the below.. Otherwise clang-22+ builds will be sad.

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6676,6 +6676,7 @@ static bool proxy_deactivate(struct rq *
  */
 static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p, int target_cpu)
+	__must_hold(__rq_lockp(rq))
 {
 	struct rq *target_rq = cpu_rq(target_cpu);
 
@@ -6718,6 +6719,7 @@ static void proxy_migrate_task(struct rq
 
 static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p)
+	__must_hold(__rq_lockp(rq))
 {
 	struct rq *this_rq, *target_rq;
 	struct rq_flags this_rf;
@@ -6811,6 +6813,7 @@ static void proxy_force_return(struct rq
  */
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+	__must_hold(__rq_lockp(rq))
 {
 	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
 	struct task_struct *owner = NULL;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-17  6:04         ` John Stultz
  2026-03-17  7:52           ` K Prateek Nayak
@ 2026-03-18 13:36           ` Peter Zijlstra
  2026-03-18 13:52             ` Peter Zijlstra
  2026-03-18 20:30             ` John Stultz
  1 sibling, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2026-03-18 13:36 UTC (permalink / raw)
  To: John Stultz
  Cc: K Prateek Nayak, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Mon, Mar 16, 2026 at 11:04:28PM -0700, John Stultz wrote:
> On Mon, Mar 16, 2026 at 10:41 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> > On 3/17/2026 10:19 AM, John Stultz wrote:
> > >
> > > I guess adding a new helper function to manually do the
> > > put_prev/set_next could be added to the top level __schedule() logic
> > > in the (prev != next) case, though we'll have to preserve the
> > > prev_donor on the stack probably.
> >
> > That seems like the best option to me too.
> >
> > Also, deadline, RT, fair, and idle don't really care about the "next"
> > argument of put_prev_task() and the only one that does care is
> > put_prev_task_scx() to call switch_class() callback so putting it as
> > either NULL or "rq->donor" should be safe.
> 
> Ack.
> Here's the change I'm testing tonight (against 6.18):
> https://github.com/johnstultz-work/linux-dev/commit/0cc72a4923143f496e33711cbcc1afdf6d861ca6
> 
> Feel free to suggest a better name for the helper function. It feels a
> little clunky (and sort of sad right after getting rid of the clunky
> proxy_tag_curr(), to re-add something so similar).

Does this capture it?

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7100,9 +7103,11 @@ static void __sched notrace __schedule(i
 pick_again:
 	assert_balance_callbacks_empty(rq);
 	next = pick_next_task(rq, rq->donor, &rf);
-	rq_set_donor(rq, next);
 	rq->next_class = next->sched_class;
 	if (sched_proxy_exec()) {
+		struct task_struct *prev_donor = rq->donor;
+
+		rq_set_donor(rq, next);
 		if (unlikely(next->blocked_on)) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next) {
@@ -7114,6 +7119,24 @@ static void __sched notrace __schedule(i
 				goto keep_resched;
 			}
 		}
+
+		/*
+		 * When transitioning like:
+		 *
+		 *	   prev		next
+		 * donor:    B		  B
+		 * curr:     A		  B
+		 *
+		 * then put_prev_set_next_task() will not have done anything,
+		 * since B == B. However, A might have missed a RT/DL balance
+		 * opportunity due to being on_cpu.
+		 */
+		if (next == rq->donor && next == prev_donor) {
+			next->sched_class->put_prev_task(rq, next, next);
+			next->sched_class->set_next_task(rq, next, true);
+		}
+	} else {
+		rq_set_donor(rq, next);
 	}
 picked:
 	clear_tsk_need_resched(prev);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-18 13:36           ` Peter Zijlstra
@ 2026-03-18 13:52             ` Peter Zijlstra
  2026-03-18 17:55               ` K Prateek Nayak
  2026-03-18 20:30             ` John Stultz
  1 sibling, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2026-03-18 13:52 UTC (permalink / raw)
  To: John Stultz
  Cc: K Prateek Nayak, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Mar 18, 2026 at 02:36:40PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 16, 2026 at 11:04:28PM -0700, John Stultz wrote:
> > On Mon, Mar 16, 2026 at 10:41 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> > > On 3/17/2026 10:19 AM, John Stultz wrote:
> > > >
> > > > I guess adding a new helper function to manually do the
> > > > put_prev/set_next could be added to the top level __schedule() logic
> > > > in the (prev != next) case, though we'll have to preserve the
> > > > prev_donor on the stack probably.
> > >
> > > That seems like the best option to me too.
> > >
> > > Also, deadline, RT, fair, and idle don't really care about the "next"
> > > argument of put_prev_task() and the only one that does care is
> > > put_prev_task_scx() to call switch_class() callback so putting it as
> > > either NULL or "rq->donor" should be safe.
> > 
> > Ack.
> > Here's the change I'm testing tonight (against 6.18):
> > https://github.com/johnstultz-work/linux-dev/commit/0cc72a4923143f496e33711cbcc1afdf6d861ca6
> > 
> > Feel free to suggest a better name for the helper function. It feels a
> > little clunky (and sort of sad right after getting rid of the clunky
> > proxy_tag_curr(), to re-add something so similar).
> 
> Does this capture it?
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7100,9 +7103,11 @@ static void __sched notrace __schedule(i
>  pick_again:
>  	assert_balance_callbacks_empty(rq);
>  	next = pick_next_task(rq, rq->donor, &rf);
> -	rq_set_donor(rq, next);
>  	rq->next_class = next->sched_class;
>  	if (sched_proxy_exec()) {
> +		struct task_struct *prev_donor = rq->donor;
> +
> +		rq_set_donor(rq, next);
>  		if (unlikely(next->blocked_on)) {
>  			next = find_proxy_task(rq, next, &rf);
>  			if (!next) {
> @@ -7114,6 +7119,24 @@ static void __sched notrace __schedule(i
>  				goto keep_resched;
>  			}
>  		}
> +
> +		/*
> +		 * When transitioning like:
> +		 *
> +		 *	   prev		next
> +		 * donor:    B		  B
> +		 * curr:     A		  B
> +		 *
> +		 * then put_prev_set_next_task() will not have done anything,
> +		 * since B == B. However, A might have missed a RT/DL balance
> +		 * opportunity due to being on_cpu.
> +		 */
> +		if (next == rq->donor && next == prev_donor) {

	&& next != prev

> +			next->sched_class->put_prev_task(rq, next, next);
> +			next->sched_class->set_next_task(rq, next, true);
> +		}
> +	} else {
> +		rq_set_donor(rq, next);
>  	}
>  picked:
>  	clear_tsk_need_resched(prev);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-18 13:52             ` Peter Zijlstra
@ 2026-03-18 17:55               ` K Prateek Nayak
  0 siblings, 0 replies; 38+ messages in thread
From: K Prateek Nayak @ 2026-03-18 17:55 UTC (permalink / raw)
  To: Peter Zijlstra, John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello Peter,

On 3/18/2026 7:22 PM, Peter Zijlstra wrote:
>> @@ -7114,6 +7119,24 @@ static void __sched notrace __schedule(i
>>  				goto keep_resched;
>>  			}
>>  		}
>> +
>> +		/*
>> +		 * When transitioning like:
>> +		 *
>> +		 *	   prev		next
>> +		 * donor:    B		  B
>> +		 * curr:     A		  B

Even for curr going A -> C, we should do this.

Consider B is blocked on C which is in turn blocked on A. B gets picked
and proxies A first as a result of the wait chain (B -> C -> A)

A runs and does an unlock with a handoff to C clearing its blocked_on.
In schedule B is picked again but this time it proxies C.

A might be on the push list and was skipped last time around since
A->on_cpu = 1 but now that is gone and we should still do put_prev_task()
+ set_next_task().

>> +		 *
>> +		 * then put_prev_set_next_task() will not have done anything,
>> +		 * since B == B. However, A might have missed a RT/DL balance
>> +		 * opportunity due to being on_cpu.
>> +		 */
>> +		if (next == rq->donor && next == prev_donor) {
> 
> 	&& next != prev
> 
>> +			next->sched_class->put_prev_task(rq, next, next);
>> +			next->sched_class->set_next_task(rq, next, true);
>> +		}
>> +	} else {
>> +		rq_set_donor(rq, next);
>>  	}
>>  picked:
>>  	clear_tsk_need_resched(prev);

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-18 12:55         ` Peter Zijlstra
@ 2026-03-18 18:01           ` K Prateek Nayak
  0 siblings, 0 replies; 38+ messages in thread
From: K Prateek Nayak @ 2026-03-18 18:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On 3/18/2026 6:25 PM, Peter Zijlstra wrote:
> On Tue, Mar 17, 2026 at 11:11:20AM +0530, K Prateek Nayak wrote:
> 
>> Also, deadline, RT, fair, and idle don't really care about the "next"
>> argument of put_prev_task() and the only one that does care is
>> put_prev_task_scx() to call switch_class() callback so putting it as
>> either NULL or "rq->donor" should be safe.
> 
> https://lkml.kernel.org/r/20260317104343.225156112@infradead.org
> 
> Makes fair care about the @next argument to put_prev_task_fair().

Ack! In this case, using "rq->donor" is better since we skip traversing
the entire cgroup hierarchy and bail-out at the common parent which is
the immediate next entity.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)
  2026-03-15 17:38   ` K Prateek Nayak
@ 2026-03-18 19:07     ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2026-03-18 19:07 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Sun, Mar 15, 2026 at 10:38 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 3/13/2026 8:00 AM, John Stultz wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index af497b8c72dce..fe20204cf51cc 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3643,6 +3643,23 @@ void update_rq_avg_idle(struct rq *rq)
> >       rq->idle_stamp = 0;
> >  }
> >
> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
> > +{
> > +     unsigned int wake_cpu;
> > +
> > +     /*
> > +      * Since we are enqueuing a blocked task on a cpu it may
> > +      * not be able to run on, preserve wake_cpu when we
> > +      * __set_task_cpu so we can return the task to where it
> > +      * was previously runnable.
> > +      */
> > +     wake_cpu = p->wake_cpu;
> > +     __set_task_cpu(p, cpu);
> > +     p->wake_cpu = wake_cpu;
> > +}
> > +#endif /* CONFIG_SCHED_PROXY_EXEC */
> > +
> >  static void
> >  ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> >                struct rq_flags *rf)
> > @@ -4242,13 +4259,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >               ttwu_queue(p, cpu, wake_flags);
> >       }
> >  out:
> > -     /*
> > -      * For now, if we've been woken up, clear the task->blocked_on
> > -      * regardless if it was set to a mutex or PROXY_WAKING so the
> > -      * task can run. We will need to be more careful later when
> > -      * properly handling proxy migration
> > -      */
> > -     clear_task_blocked_on(p, NULL);
>
> So, for this bit, there are mutex variants that are interruptible and
> killable which probably benefits from clearing the blocked_on
> relation.

This is a good point! I need to re-review some of this with that in mind.

> For potential proxy task that are still queued, we'll hit the
> ttwu_runnable() path and resched out of there so it makes sense to
> mark them as PROXY_WAKING so schedule() can return migrate them, they
> run and  hit the signal_pending_state() check in __mutex_lock_common()
> loop, and return -EINTR.
>
> Otherwise, if they need a full wakeup, they may be blocked on a
> sleeping owner, in which case it is beneficial to clear blocked_on, do
> a full wakeup. and let them run to evaluate the pending signal.
>
> ttwu_state_match() should filter out any spurious signals. Thoughts?

So, I don't think we can keep clear_task_blocked_on(p, NULL) in the
out: path there, as then any wakeup would allow the task to run on
that runqueue, even if it was not smp affined.

But if we did go through the select_task_rq() logic, then clearing the
blocked_on bit should be safe.  However if blocked_on is set, the task
is likely to be on the rq, so most cases will shortcut at
ttwu_runnable(), so we probably wouldn't get there.

So maybe if I understand your suggestion, we should
clear_task_blocked_on() if we select_task_rq(), and otherwise in the
error path set any blocked_on value to PROXY_WAKING?

I guess this could also move the set_task_blocked_on_waking into ttwu
instead of the lock waker logic. I'll play with that.

> > +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p)
> > +{
> > +     struct rq *this_rq, *target_rq;
> > +     struct rq_flags this_rf;
> > +     int cpu, wake_flag = WF_TTWU;
> > +
> > +     lockdep_assert_rq_held(rq);
> > +     WARN_ON(p == rq->curr);
> > +
> > +     /*
> > +      * We have to zap callbacks before unlocking the rq
> > +      * as another CPU may jump in and call sched_balance_rq
> > +      * which can trip the warning in rq_pin_lock() if we
> > +      * leave callbacks set.
> > +      */
> > +     zap_balance_callbacks(rq);
> > +     rq_unpin_lock(rq, rf);
> > +     raw_spin_rq_unlock(rq);
> > +
> > +     /*
> > +      * We drop the rq lock, and re-grab task_rq_lock to get
> > +      * the pi_lock (needed for select_task_rq) as well.
> > +      */
> > +     this_rq = task_rq_lock(p, &this_rf);
> > +
> > +     /*
> > +      * Since we let go of the rq lock, the task may have been
> > +      * woken or migrated to another rq before we  got the
> > +      * task_rq_lock. So re-check we're on the same RQ. If
> > +      * not, the task has already been migrated and that CPU
> > +      * will handle any futher migrations.
> > +      */
> > +     if (this_rq != rq)
> > +             goto err_out;
> > +
> > +     /* Similarly, if we've been dequeued, someone else will wake us */
> > +     if (!task_on_rq_queued(p))
> > +             goto err_out;
> > +
> > +     /*
> > +      * Since we should only be calling here from __schedule()
> > +      * -> find_proxy_task(), no one else should have
> > +      * assigned current out from under us. But check and warn
> > +      * if we see this, then bail.
> > +      */
> > +     if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> > +             WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> > +                       __func__, cpu_of(this_rq),
> > +                       p->comm, p->pid, p->on_cpu);
> > +             goto err_out;
> >       }
> > -     return NULL;
> > +
> > +     update_rq_clock(this_rq);
> > +     proxy_resched_idle(this_rq);
>
> I still think this is too late, and only required if we are moving the
> donor. Can we do this before we drop the rq_lock so that a remote
> wakeup doesn't need to clear the this? (although I think we don't have

Sorry I'm not sure I'm following this bit. Are you suggesting the
update_rq_clock goes above the error handling? Or are you suggesting I
move proxy_resched_idle() elsewhere?

> that bit in the ttwu path anymore and we rely on the schedule() bits
> completely for return migration on this version - any particular
> reason?).

Yes, Peter wanted the return-migration via ttwu to be in a separate patch:
https://lore.kernel.org/lkml/20251009114302.GI3245006@noisy.programming.kicks-ass.net/


>
> > +     deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
> > +     cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> > +     set_task_cpu(p, cpu);
> > +     target_rq = cpu_rq(cpu);
> > +     clear_task_blocked_on(p, NULL);
> > +     task_rq_unlock(this_rq, p, &this_rf);
> > +
> > +     attach_one_task(target_rq, p);
>
> I'm still having a hard time believing we cannot use wake_up_process()
> but let me look more into that tomorrow when the sun rises.

I'm curious to hear if you had much luck on this. I've tinkered a bit
today, but keep on hitting the same issue:

<<<Task A>>>
__mutex_unlock_slowpath(lock);
  set_task_blocked_on_waking(task_B, lock);
  wake_up_process(task_B); /* via wake_up_q() */
    try_to_wake_up(task_B, TASK_NORMAL, 0);
      ttwu_runnable(task_B, WF_TTWU);  /*donor is on_rq, so we trip into this */
        ttwu_do_wakeup(task_B);
          WRITE_ONCE(p->__state, TASK_RUNNING);
  preempt_schedule_irq()
    __schedule()
       next = pick_next_task(); /* returns task_B  (still PROXY_WAKING) */
       find_proxy_task(rq, task_B, &rf)
         proxy_force_return(rq, rf, task_B);

At this point conceptually we want to dequeue task_B from the
runqueue, and call wake_up_process() so it will be return-migrated to
a runqueue it can run on.

However, the task state is already TASK_RUNNING now, so calling
wake_up_process() again will just shortcut out at ttwu_state_mach().
Transitioning to INTERRUPTABLE or something else before calling
wake_up_process seems risky to me (but let me know if I'm wrong here).
So to me, doing the manual deactivate/select_task_rq/attach_one_task
work in proxy_force_return() seems the most straight forward, even
though it is a little duplicative of the ttwu logic.

I think when I had something similar before, it was leaning on
modifications to ttwu(), which this patch avoids at Peter's request.
Though maybe this logic can be simplified with the later optimization
patch to do return migration in the ttwu path?

thanks
-john

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-18 13:36           ` Peter Zijlstra
  2026-03-18 13:52             ` Peter Zijlstra
@ 2026-03-18 20:30             ` John Stultz
  2026-03-18 20:34               ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: John Stultz @ 2026-03-18 20:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Mar 18, 2026 at 6:36 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Mar 16, 2026 at 11:04:28PM -0700, John Stultz wrote:
> > On Mon, Mar 16, 2026 at 10:41 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> > > On 3/17/2026 10:19 AM, John Stultz wrote:
> > > >
> > > > I guess adding a new helper function to manually do the
> > > > put_prev/set_next could be added to the top level __schedule() logic
> > > > in the (prev != next) case, though we'll have to preserve the
> > > > prev_donor on the stack probably.
> > >
> > > That seems like the best option to me too.
> > >
> > > Also, deadline, RT, fair, and idle don't really care about the "next"
> > > argument of put_prev_task() and the only one that does care is
> > > put_prev_task_scx() to call switch_class() callback so putting it as
> > > either NULL or "rq->donor" should be safe.
> >
> > Ack.
> > Here's the change I'm testing tonight (against 6.18):
> > https://github.com/johnstultz-work/linux-dev/commit/0cc72a4923143f496e33711cbcc1afdf6d861ca6
> >
> > Feel free to suggest a better name for the helper function. It feels a
> > little clunky (and sort of sad right after getting rid of the clunky
> > proxy_tag_curr(), to re-add something so similar).
>
> Does this capture it?

Yeah, this looks equivalent to what I had in my patch linked above.
I'll go ahead and take your version though.

The only tweak I might consider is setting the prev_donor along with
prev, so we don't have to move the rq_set_donor() call to be on both
sides of the sched_proxy_exec() conditional.

Feels much simpler to read, and I'd hope the compiler would optimize
the extra assignment away if it were unused at compile time. But maybe
there is a concern you have I'm missing?

thanks
-john

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-18 20:30             ` John Stultz
@ 2026-03-18 20:34               ` Peter Zijlstra
  2026-03-18 20:35                 ` John Stultz
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2026-03-18 20:34 UTC (permalink / raw)
  To: John Stultz
  Cc: K Prateek Nayak, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Mar 18, 2026 at 01:30:17PM -0700, John Stultz wrote:

> Yeah, this looks equivalent to what I had in my patch linked above.
> I'll go ahead and take your version though.
> 
> The only tweak I might consider is setting the prev_donor along with
> prev, so we don't have to move the rq_set_donor() call to be on both
> sides of the sched_proxy_exec() conditional.
> 
> Feels much simpler to read, and I'd hope the compiler would optimize
> the extra assignment away if it were unused at compile time. But maybe
> there is a concern you have I'm missing?

So my goal was to capture as much of the PE specific stuff inside that
sched_proxy_exec() branch.

Having to push that rq_set_donor() into the else branch was a little
unfortunate, but keeping it all inside that branch makes that we can
easily break that out into its own function.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-18 20:34               ` Peter Zijlstra
@ 2026-03-18 20:35                 ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2026-03-18 20:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Mar 18, 2026 at 1:34 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Mar 18, 2026 at 01:30:17PM -0700, John Stultz wrote:
>
> > Yeah, this looks equivalent to what I had in my patch linked above.
> > I'll go ahead and take your version though.
> >
> > The only tweak I might consider is setting the prev_donor along with
> > prev, so we don't have to move the rq_set_donor() call to be on both
> > sides of the sched_proxy_exec() conditional.
> >
> > Feels much simpler to read, and I'd hope the compiler would optimize
> > the extra assignment away if it were unused at compile time. But maybe
> > there is a concern you have I'm missing?
>
> So my goal was to capture as much of the PE specific stuff inside that
> sched_proxy_exec() branch.
>
> Having to push that rq_set_donor() into the else branch was a little
> unfortunate, but keeping it all inside that branch makes that we can
> easily break that out into its own function.

Ok. Sounds good. I'll match your suggestion then.

Appreciate the quick feedback!
-john

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)
  2026-03-13  2:30 ` [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration) John Stultz
                     ` (2 preceding siblings ...)
  2026-03-18 12:59   ` Peter Zijlstra
@ 2026-03-19 12:49   ` Peter Zijlstra
  2026-03-19 21:26     ` John Stultz
  3 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2026-03-19 12:49 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Mar 13, 2026 at 02:30:10AM +0000, John Stultz wrote:
> +/*
> + * If the blocked-on relationship crosses CPUs, migrate @p to the
> + * owner's CPU.
> + *
> + * This is because we must respect the CPU affinity of execution
> + * contexts (owner) but we can ignore affinity for scheduling
> + * contexts (@p). So we have to move scheduling contexts towards
> + * potential execution contexts.
> + *
> + * Note: The owner can disappear, but simply migrate to @target_cpu
> + * and leave that CPU to sort things out.
> + */
> +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p, int target_cpu)
>  {
> +	struct rq *target_rq = cpu_rq(target_cpu);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * Since we're going to drop @rq, we have to put(@rq->donor) first,
> +	 * otherwise we have a reference that no longer belongs to us.
> +	 *
> +	 * Additionally, as we put_prev_task(prev) earlier, its possible that
> +	 * prev will migrate away as soon as we drop the rq lock, however we
> +	 * still have it marked as rq->curr, as we've not yet switched tasks.
> +	 *
> +	 * So call proxy_resched_idle() to let go of the references before
> +	 * we release the lock.
> +	 */
> +	proxy_resched_idle(rq);

This comment confuses the heck out of me. It seems to imply we need to
schedule before dropping rq->lock.

> +
> +	WARN_ON(p == rq->curr);
> +
> +	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +	proxy_set_task_cpu(p, target_cpu);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */

It might be good to explain where these callbacks come from.

> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);
> +
> +	attach_one_task(target_rq, p);
> +
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
> +}
> +
> +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p)
> +{
> +	struct rq *this_rq, *target_rq;
> +	struct rq_flags this_rf;
> +	int cpu, wake_flag = WF_TTWU;
> +
> +	lockdep_assert_rq_held(rq);
> +	WARN_ON(p == rq->curr);
> +
> +	/*
> +	 * We have to zap callbacks before unlocking the rq
> +	 * as another CPU may jump in and call sched_balance_rq
> +	 * which can trip the warning in rq_pin_lock() if we
> +	 * leave callbacks set.
> +	 */

idem

> +	zap_balance_callbacks(rq);
> +	rq_unpin_lock(rq, rf);
> +	raw_spin_rq_unlock(rq);

This is in fact the very same sequence as above.

> +
> +	/*
> +	 * We drop the rq lock, and re-grab task_rq_lock to get
> +	 * the pi_lock (needed for select_task_rq) as well.
> +	 */
> +	this_rq = task_rq_lock(p, &this_rf);
> +
> +	/*
> +	 * Since we let go of the rq lock, the task may have been
> +	 * woken or migrated to another rq before we  got the
> +	 * task_rq_lock. So re-check we're on the same RQ. If
> +	 * not, the task has already been migrated and that CPU
> +	 * will handle any futher migrations.
> +	 */
> +	if (this_rq != rq)
> +		goto err_out;
> +
> +	/* Similarly, if we've been dequeued, someone else will wake us */
> +	if (!task_on_rq_queued(p))
> +		goto err_out;
> +
> +	/*
> +	 * Since we should only be calling here from __schedule()
> +	 * -> find_proxy_task(), no one else should have
> +	 * assigned current out from under us. But check and warn
> +	 * if we see this, then bail.
> +	 */
> +	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
> +		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> +			  __func__, cpu_of(this_rq),
> +			  p->comm, p->pid, p->on_cpu);
> +		goto err_out;
>  	}
> -	return NULL;
> +
> +	update_rq_clock(this_rq);
> +	proxy_resched_idle(this_rq);
> +	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
> +	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> +	set_task_cpu(p, cpu);
> +	target_rq = cpu_rq(cpu);
> +	clear_task_blocked_on(p, NULL);
> +	task_rq_unlock(this_rq, p, &this_rf);
> +
> +	attach_one_task(target_rq, p);
> +
> +	/* Finally, re-grab the origianl rq lock and return to pick-again */
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
> +	return;
> +
> +err_out:
> +	task_rq_unlock(this_rq, p, &this_rf);
> +	raw_spin_rq_lock(rq);
> +	rq_repin_lock(rq, rf);
> +	update_rq_clock(rq);
>  }

Hurm... how about something like so?

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6662,6 +6662,28 @@ static bool proxy_deactivate(struct rq *
 	return try_to_block_task(rq, donor, &state, true);
 }
 
+static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
+	__releases(__rq_lockp(rq))
+{
+	/*
+	 * We have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+}
+
+static inline void proxy_reacquire_rq_lock(struct rq *rq, struct rq_flags *rf)
+	__acquires(__rq_lockp(rq))
+{
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
+}
+
 /*
  * If the blocked-on relationship crosses CPUs, migrate @p to the
  * owner's CPU.
@@ -6676,6 +6698,7 @@ static bool proxy_deactivate(struct rq *
  */
 static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p, int target_cpu)
+	__must_hold(__rq_lockp(rq))
 {
 	struct rq *target_rq = cpu_rq(target_cpu);
 
@@ -6699,98 +6722,72 @@ static void proxy_migrate_task(struct rq
 	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
 	proxy_set_task_cpu(p, target_cpu);
 
-	/*
-	 * We have to zap callbacks before unlocking the rq
-	 * as another CPU may jump in and call sched_balance_rq
-	 * which can trip the warning in rq_pin_lock() if we
-	 * leave callbacks set.
-	 */
-	zap_balance_callbacks(rq);
-	rq_unpin_lock(rq, rf);
-	raw_spin_rq_unlock(rq);
+	proxy_release_rq_lock(rq, rf);
 
 	attach_one_task(target_rq, p);
 
-	raw_spin_rq_lock(rq);
-	rq_repin_lock(rq, rf);
-	update_rq_clock(rq);
+	proxy_reacquire_rq_lock(rq, rf);
 }
 
 static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
 			       struct task_struct *p)
+	__must_hold(__rq_lockp(rq))
 {
-	struct rq *this_rq, *target_rq;
-	struct rq_flags this_rf;
+	struct rq *task_rq, *target_rq = NULL;
 	int cpu, wake_flag = WF_TTWU;
 
 	lockdep_assert_rq_held(rq);
 	WARN_ON(p == rq->curr);
 
-	/*
-	 * We have to zap callbacks before unlocking the rq
-	 * as another CPU may jump in and call sched_balance_rq
-	 * which can trip the warning in rq_pin_lock() if we
-	 * leave callbacks set.
-	 */
-	zap_balance_callbacks(rq);
-	rq_unpin_lock(rq, rf);
-	raw_spin_rq_unlock(rq);
+	proxy_release_rq_lock(rq, rf);
 
 	/*
 	 * We drop the rq lock, and re-grab task_rq_lock to get
 	 * the pi_lock (needed for select_task_rq) as well.
 	 */
-	this_rq = task_rq_lock(p, &this_rf);
+	scoped_guard (task_rq_lock, p) {
+		task_rq = scope.rq;
 
-	/*
-	 * Since we let go of the rq lock, the task may have been
-	 * woken or migrated to another rq before we  got the
-	 * task_rq_lock. So re-check we're on the same RQ. If
-	 * not, the task has already been migrated and that CPU
-	 * will handle any futher migrations.
-	 */
-	if (this_rq != rq)
-		goto err_out;
-
-	/* Similarly, if we've been dequeued, someone else will wake us */
-	if (!task_on_rq_queued(p))
-		goto err_out;
-
-	/*
-	 * Since we should only be calling here from __schedule()
-	 * -> find_proxy_task(), no one else should have
-	 * assigned current out from under us. But check and warn
-	 * if we see this, then bail.
-	 */
-	if (task_current(this_rq, p) || task_on_cpu(this_rq, p)) {
-		WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
-			  __func__, cpu_of(this_rq),
-			  p->comm, p->pid, p->on_cpu);
-		goto err_out;
+		/*
+		 * Since we let go of the rq lock, the task may have been
+		 * woken or migrated to another rq before we  got the
+		 * task_rq_lock. So re-check we're on the same RQ. If
+		 * not, the task has already been migrated and that CPU
+		 * will handle any futher migrations.
+		 */
+		if (task_rq != rq)
+			break;
+
+		/* Similarly, if we've been dequeued, someone else will wake us */
+		if (!task_on_rq_queued(p))
+			break;
+
+		/*
+		 * Since we should only be calling here from __schedule()
+		 * -> find_proxy_task(), no one else should have
+		 * assigned current out from under us. But check and warn
+		 * if we see this, then bail.
+		 */
+		if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
+			WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
+				  __func__, cpu_of(task_rq),
+				  p->comm, p->pid, p->on_cpu);
+			break;
+		}
+
+		update_rq_clock(task_rq);
+		proxy_resched_idle(task_rq);
+		deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
+		cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
+		set_task_cpu(p, cpu);
+		target_rq = cpu_rq(cpu);
+		clear_task_blocked_on(p, NULL);
 	}
 
-	update_rq_clock(this_rq);
-	proxy_resched_idle(this_rq);
-	deactivate_task(this_rq, p, DEQUEUE_NOCLOCK);
-	cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
-	set_task_cpu(p, cpu);
-	target_rq = cpu_rq(cpu);
-	clear_task_blocked_on(p, NULL);
-	task_rq_unlock(this_rq, p, &this_rf);
+	if (target_rq)
+		attach_one_task(target_rq, p);
 
-	attach_one_task(target_rq, p);
-
-	/* Finally, re-grab the origianl rq lock and return to pick-again */
-	raw_spin_rq_lock(rq);
-	rq_repin_lock(rq, rf);
-	update_rq_clock(rq);
-	return;
-
-err_out:
-	task_rq_unlock(this_rq, p, &this_rf);
-	raw_spin_rq_lock(rq);
-	rq_repin_lock(rq, rf);
-	update_rq_clock(rq);
+	proxy_reacquire_rq_lock(rq, rf);
 }
 
 /*
@@ -6811,6 +6808,7 @@ static void proxy_force_return(struct rq
  */
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+	__must_hold(__rq_lockp(rq))
 {
 	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
 	struct task_struct *owner = NULL;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration)
  2026-03-19 12:49   ` Peter Zijlstra
@ 2026-03-19 21:26     ` John Stultz
  0 siblings, 0 replies; 38+ messages in thread
From: John Stultz @ 2026-03-19 21:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Mar 19, 2026 at 5:50 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Mar 13, 2026 at 02:30:10AM +0000, John Stultz wrote:
> > +/*
> > + * If the blocked-on relationship crosses CPUs, migrate @p to the
> > + * owner's CPU.
> > + *
> > + * This is because we must respect the CPU affinity of execution
> > + * contexts (owner) but we can ignore affinity for scheduling
> > + * contexts (@p). So we have to move scheduling contexts towards
> > + * potential execution contexts.
> > + *
> > + * Note: The owner can disappear, but simply migrate to @target_cpu
> > + * and leave that CPU to sort things out.
> > + */
> > +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p, int target_cpu)
> >  {
> > +     struct rq *target_rq = cpu_rq(target_cpu);
> > +
> > +     lockdep_assert_rq_held(rq);
> > +
> > +     /*
> > +      * Since we're going to drop @rq, we have to put(@rq->donor) first,
> > +      * otherwise we have a reference that no longer belongs to us.
> > +      *
> > +      * Additionally, as we put_prev_task(prev) earlier, its possible that
> > +      * prev will migrate away as soon as we drop the rq lock, however we
> > +      * still have it marked as rq->curr, as we've not yet switched tasks.
> > +      *
> > +      * So call proxy_resched_idle() to let go of the references before
> > +      * we release the lock.
> > +      */
> > +     proxy_resched_idle(rq);
>
> This comment confuses the heck out of me. It seems to imply we need to
> schedule before dropping rq->lock.

Fair point, I wrote that awhile back and indeed it's not really clear
(the rq->curr bit doesn't make much sense to me now).

There is a similar explanation is in proxy_deactivate() which maybe is
more clear?

Bascially since we are migrating a blocked donor, it could be
rq->donor, and we want to make sure there aren't any references from
this rq to it before we drop the lock. This avoids another cpu jumping
in and grabbing the rq lock and referencing rq->donor or cfs_rq->curr,
etc after we have migrated it to another cpu.

I'll rework it the comment to something like the above, but feel free
to suggest rewordings if you prefer.


> > +
> > +     WARN_ON(p == rq->curr);
> > +
> > +     deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> > +     proxy_set_task_cpu(p, target_cpu);
> > +
> > +     /*
> > +      * We have to zap callbacks before unlocking the rq
> > +      * as another CPU may jump in and call sched_balance_rq
> > +      * which can trip the warning in rq_pin_lock() if we
> > +      * leave callbacks set.
> > +      */
>
> It might be good to explain where these callbacks come from.

Ack. I've taken a swing at this and will include it in the next revision.

>
> > +     zap_balance_callbacks(rq);
> > +     rq_unpin_lock(rq, rf);
> > +     raw_spin_rq_unlock(rq);
> > +
> > +     attach_one_task(target_rq, p);
> > +
> > +     raw_spin_rq_lock(rq);
> > +     rq_repin_lock(rq, rf);
> > +     update_rq_clock(rq);
> > +}
> > +
> > +static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> > +                            struct task_struct *p)
> > +{
> > +     struct rq *this_rq, *target_rq;
> > +     struct rq_flags this_rf;
> > +     int cpu, wake_flag = WF_TTWU;
> > +
> > +     lockdep_assert_rq_held(rq);
> > +     WARN_ON(p == rq->curr);
> > +
> > +     /*
> > +      * We have to zap callbacks before unlocking the rq
> > +      * as another CPU may jump in and call sched_balance_rq
> > +      * which can trip the warning in rq_pin_lock() if we
> > +      * leave callbacks set.
> > +      */
>
> idem
>
> > +     zap_balance_callbacks(rq);
> > +     rq_unpin_lock(rq, rf);
> > +     raw_spin_rq_unlock(rq);
>
> This is in fact the very same sequence as above.
>
...
>
> Hurm... how about something like so?

Sounds good. I've worked this in and am testing it now.

Thanks for the feedback and suggestions!
-john

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2026-03-19 21:26 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-13  2:30 [PATCH v25 0/9] Simple Donor Migration for Proxy Execution John Stultz
2026-03-13  2:30 ` [PATCH v25 1/9] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
2026-03-13 13:48   ` Juri Lelli
2026-03-13 17:53     ` John Stultz
2026-03-15 16:26   ` K Prateek Nayak
2026-03-17  4:49     ` John Stultz
2026-03-17  5:41       ` K Prateek Nayak
2026-03-17  6:04         ` John Stultz
2026-03-17  7:52           ` K Prateek Nayak
2026-03-17 18:35             ` John Stultz
2026-03-18 13:36           ` Peter Zijlstra
2026-03-18 13:52             ` Peter Zijlstra
2026-03-18 17:55               ` K Prateek Nayak
2026-03-18 20:30             ` John Stultz
2026-03-18 20:34               ` Peter Zijlstra
2026-03-18 20:35                 ` John Stultz
2026-03-18 12:55         ` Peter Zijlstra
2026-03-18 18:01           ` K Prateek Nayak
2026-03-13  2:30 ` [PATCH v25 2/9] sched: Minimise repeated sched_proxy_exec() checking John Stultz
2026-03-15 17:01   ` K Prateek Nayak
2026-03-13  2:30 ` [PATCH v25 3/9] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
2026-03-13  2:30 ` [PATCH v25 4/9] sched: Fix modifying donor->blocked on without proper locking John Stultz
2026-03-13  2:30 ` [PATCH v25 5/9] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
2026-03-13  2:30 ` [PATCH v25 6/9] sched: Add assert_balance_callbacks_empty helper John Stultz
2026-03-13  2:30 ` [PATCH v25 7/9] sched: Add logic to zap balance callbacks if we pick again John Stultz
2026-03-13  2:30 ` [PATCH v25 8/9] sched: Move attach_one_task and attach_task helpers to sched.h John Stultz
2026-03-15 16:34   ` K Prateek Nayak
2026-03-16 23:34     ` John Stultz
2026-03-17  2:29       ` K Prateek Nayak
2026-03-13  2:30 ` [PATCH v25 9/9] sched: Handle blocked-waiter migration (and return migration) John Stultz
2026-03-15 17:38   ` K Prateek Nayak
2026-03-18 19:07     ` John Stultz
2026-03-18  6:35   ` Juri Lelli
2026-03-18  6:56     ` K Prateek Nayak
2026-03-18 10:16       ` Juri Lelli
2026-03-18 12:59   ` Peter Zijlstra
2026-03-19 12:49   ` Peter Zijlstra
2026-03-19 21:26     ` John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox