[PATCH v26 00/10] Simple Donor Migration for Proxy Execution

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
@ 2026-03-24 19:13 John Stultz
  2026-03-24 19:13 ` [PATCH v26 01/10] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
                   ` (10 more replies)
  0 siblings, 11 replies; 56+ messages in thread
From: John Stultz @ 2026-03-24 19:13 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Hey All,

Yet another iteration on the next chunk of the Proxy Exec
series: Simple Donor Migration

This is just the next step for Proxy Execution, to allow us to
migrate blocked donors across runqueues to boost remote lock
owners.

As always, I’m trying to submit this larger work in smallish
digestible pieces, so in this portion of the series, I’m only
submitting for review and consideration some recent fixups, and
the logic that allows us to do donor(blocked waiter) migration,
which requires some additional changes to locking and extra
state tracking to ensure we don’t accidentally run a migrated
donor on a cpu it isn’t affined to, as well as some extra
handling to deal with balance callback state that needs to be
reset when we decide to pick a different task after doing donor
migration.

I really want to share my appreciation for feedback provided by
Peter, K Prateek and Juri on the last revision! 

New in this iteration:
* Fix missed balancing opportunity that K Prateek noticed

* Fix bug in pick_next_pushable_task_dl() that Juri noticed

* Use guard() in attach_one_task() as suggested by K Prateek

* Add context analysis annotations, as suggested by Peter

* Introduce proxy_release/reaquire_rq_lock() helpers as
  suggested by Peter

* Rework comments and logic in numerous places to address
  feedback from Peter

* Mark tasks PROXY_WAKING if try_to_block_task() fails due to a
  signal, as noted by K Prateek

I’d love to get further feedback on any place where these
patches are confusing, or could use additional clarifications.

There’s also been some further improvements In the full Proxy
Execution series:
* Tweaks to proxy_needs_return() suggested by K Prateek

* Additional tweaks to address concern about signal edge cases
  from K Prateek

I’d appreciate any testing or comments that folks have with
the full set!

You can find the full Proxy Exec series here:
  https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v26-7.0-rc5/
  https://github.com/johnstultz-work/linux-dev.git proxy-exec-v26-7.0-rc5

Issues still to address with the full series:
* Resolve a regression in the later optimized donor-migration
  changes combined with “Fix 'stuck' dl_server” change in 6.19

* With the full series against 7.0-rc3, when doing heavy stress
  testing, I’m occasionally hitting crashes due to null return
  from __pick_eevdf(). Need to dig on this and find why it
  doesn’t happen against 6.18

* Continue working to get sched_ext to be ok with Proxy
  Execution enabled.

* Reevaluate performance regression K Prateek Nayak found with
  the full series.

* The chain migration functionality needs further iterations and
  better validation to ensure it truly maintains the RT/DL load
  balancing invariants (despite this being broken in vanilla
  upstream with RT_PUSH_IPI currently)

Future work:
* Expand to more locking primitives: Figuring out pi-futexes
  would be good, using proxy for Binder PI is something else
  we’re exploring.

* Eventually: Work to replace rt_mutexes and get things happy
   with PREEMPT_RT

I’d love any feedback or review thoughts on the full series as
well. I’m trying to keep the chunks small, reviewable and
iteratively testable, but if you have any suggestions on how to
improve the larger series, I’m all ears.

Credit/Disclaimer:
—--------------------
As always, this Proxy Execution series has a long history with
lots of developers that deserve credit: 

First described in a paper[1] by Watkins, Straub, Niehaus, then
from patches from Peter Zijlstra, extended with lots of work by
Juri Lelli, Valentin Schneider, and Connor O'Brien. (and thank
you to Steven Rostedt for providing additional details here!).
Thanks also to Joel Fernandes, Dietmar Eggemann, Metin Kaya,
K Prateek Nayak and Suleiman Souhlal for their substantial
review, suggestion, and patch contributions.

So again, many thanks to those above, as all the credit for this
series really is due to them - while the mistakes are surely
mine.

Thanks so much!
-john

[1] https://static.lwn.net/images/conf/rtlws11/papers/proc/p38.pdf

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com

John Stultz (10):
  sched: Make class_schedulers avoid pushing current, and get rid of
    proxy_tag_curr()
  sched: Minimise repeated sched_proxy_exec() checking
  sched: Fix potentially missing balancing with Proxy Exec
  locking: Add task::blocked_lock to serialize blocked_on state
  sched: Fix modifying donor->blocked on without proper locking
  sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy
    return-migration
  sched: Add assert_balance_callbacks_empty helper
  sched: Add logic to zap balance callbacks if we pick again
  sched: Move attach_one_task and attach_task helpers to sched.h
  sched: Handle blocked-waiter migration (and return migration)

 include/linux/sched.h        |  91 ++++++----
 init/init_task.c             |   1 +
 kernel/fork.c                |   1 +
 kernel/locking/mutex-debug.c |   4 +-
 kernel/locking/mutex.c       |  40 +++--
 kernel/locking/mutex.h       |   6 +
 kernel/locking/ww_mutex.h    |  16 +-
 kernel/sched/core.c          | 328 +++++++++++++++++++++++++++++------
 kernel/sched/deadline.c      |  18 +-
 kernel/sched/fair.c          |  26 ---
 kernel/sched/rt.c            |  15 +-
 kernel/sched/sched.h         |  32 +++-
 12 files changed, 442 insertions(+), 136 deletions(-)

-- 
2.53.0.1018.g2bb0e51243-goog

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v26 01/10] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
@ 2026-03-24 19:13 ` John Stultz
  2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
  2026-03-24 19:13 ` [PATCH v26 02/10] sched: Minimise repeated sched_proxy_exec() checking John Stultz
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 56+ messages in thread
From: John Stultz @ 2026-03-24 19:13 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Peter Zijlstra, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

With proxy-execution, the scheduler selects the donor, but for
blocked donors, we end up running the lock owner.

This caused some complexity, because the class schedulers make
sure to remove the task they pick from their pushable task
lists, which prevents the donor from being migrated, but there
wasn't then anything to prevent rq->curr from being migrated
if rq->curr != rq->donor.

This was sort of hacked around by calling proxy_tag_curr() on
the rq->curr task if we were running something other then the
donor. proxy_tag_curr() did a dequeue/enqueue pair on the
rq->curr task, allowing the class schedulers to remove it from
their pushable list.

The dequeue/enqueue pair was wasteful, and additonally K Prateek
highlighted that we didn't properly undo things when we stopped
proxying, leaving the lock owner off the pushable list.

After some alternative approaches were considered, Peter
suggested just having the RT/DL classes just avoid migrating
when task_on_cpu().

So rework pick_next_pushable_dl_task() and the rt
pick_next_pushable_task() functions so that they skip over the
first pushable task if it is on_cpu.

Then just drop all of the proxy_tag_curr() logic.

Fixes: be39617e38e0 ("sched: Fix proxy/current (push,pull)ability")
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@amd.com/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
---
v26:
* Fix issue Juri noticed by using a separate iterator value in
  pick_next_pusahble_task_dl()

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c     | 24 ------------------------
 kernel/sched/deadline.c | 18 ++++++++++++++++--
 kernel/sched/rt.c       | 15 ++++++++++++---
 3 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 496dff740dcaf..92b1807c05a4e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6705,23 +6705,6 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 }
 #endif /* SCHED_PROXY_EXEC */
 
-static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
-{
-	if (!sched_proxy_exec())
-		return;
-	/*
-	 * pick_next_task() calls set_next_task() on the chosen task
-	 * at some point, which ensures it is not push/pullable.
-	 * However, the chosen/donor task *and* the mutex owner form an
-	 * atomic pair wrt push/pull.
-	 *
-	 * Make sure owner we run is not pushable. Unfortunately we can
-	 * only deal with that by means of a dequeue/enqueue cycle. :-/
-	 */
-	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
-	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
-}
-
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6874,9 +6857,6 @@ static void __sched notrace __schedule(int sched_mode)
 		 */
 		RCU_INIT_POINTER(rq->curr, next);
 
-		if (!task_current_donor(rq, next))
-			proxy_tag_curr(rq, next);
-
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -6910,10 +6890,6 @@ static void __sched notrace __schedule(int sched_mode)
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
-		/* In case next was already curr but just got blocked_donor */
-		if (!task_current_donor(rq, next))
-			proxy_tag_curr(rq, next);
-
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq, NULL);
 		raw_spin_rq_unlock_irq(rq);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d08b004293234..52c524f5ba4dd 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2801,12 +2801,26 @@ static int find_later_rq(struct task_struct *task)
 
 static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 {
-	struct task_struct *p;
+	struct task_struct *i, *p = NULL;
+	struct rb_node *next_node;
 
 	if (!has_pushable_dl_tasks(rq))
 		return NULL;
 
-	p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
+	next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root);
+	while (next_node) {
+		i = __node_2_pdl(next_node);
+		/* make sure task isn't on_cpu (possible with proxy-exec) */
+		if (!task_on_cpu(rq, i)) {
+			p = i;
+			break;
+		}
+
+		next_node = rb_next(next_node);
+	}
+
+	if (!p)
+		return NULL;
 
 	WARN_ON_ONCE(rq->cpu != task_cpu(p));
 	WARN_ON_ONCE(task_current(rq, p));
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f69e1f16d9238..61569b622d1a3 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1853,13 +1853,22 @@ static int find_lowest_rq(struct task_struct *task)
 
 static struct task_struct *pick_next_pushable_task(struct rq *rq)
 {
-	struct task_struct *p;
+	struct plist_head *head = &rq->rt.pushable_tasks;
+	struct task_struct *i, *p = NULL;
 
 	if (!has_pushable_tasks(rq))
 		return NULL;
 
-	p = plist_first_entry(&rq->rt.pushable_tasks,
-			      struct task_struct, pushable_tasks);
+	plist_for_each_entry(i, head, pushable_tasks) {
+		/* make sure task isn't on_cpu (possible with proxy-exec) */
+		if (!task_on_cpu(rq, i)) {
+			p = i;
+			break;
+		}
+	}
+
+	if (!p)
+		return NULL;
 
 	BUG_ON(rq->cpu != task_cpu(p));
 	BUG_ON(task_current(rq, p));
-- 
2.53.0.1018.g2bb0e51243-goog


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()
  2026-03-24 19:13 ` [PATCH v26 01/10] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
@ 2026-04-03 12:30   ` tip-bot2 for John Stultz
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for John Stultz @ 2026-04-03 12:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, Peter Zijlstra, John Stultz, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e0ca8991b2de6c9dfe6fcd8a0364951b2bd56797
Gitweb:        https://git.kernel.org/tip/e0ca8991b2de6c9dfe6fcd8a0364951b2bd56797
Author:        John Stultz <jstultz@google.com>
AuthorDate:    Tue, 24 Mar 2026 19:13:16 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 03 Apr 2026 14:23:38 +02:00

sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr()

With proxy-execution, the scheduler selects the donor, but for
blocked donors, we end up running the lock owner.

This caused some complexity, because the class schedulers make
sure to remove the task they pick from their pushable task
lists, which prevents the donor from being migrated, but there
wasn't then anything to prevent rq->curr from being migrated
if rq->curr != rq->donor.

This was sort of hacked around by calling proxy_tag_curr() on
the rq->curr task if we were running something other then the
donor. proxy_tag_curr() did a dequeue/enqueue pair on the
rq->curr task, allowing the class schedulers to remove it from
their pushable list.

The dequeue/enqueue pair was wasteful, and additonally K Prateek
highlighted that we didn't properly undo things when we stopped
proxying, leaving the lock owner off the pushable list.

After some alternative approaches were considered, Peter
suggested just having the RT/DL classes just avoid migrating
when task_on_cpu().

So rework pick_next_pushable_dl_task() and the rt
pick_next_pushable_task() functions so that they skip over the
first pushable task if it is on_cpu.

Then just drop all of the proxy_tag_curr() logic.

Fixes: be39617e38e0 ("sched: Fix proxy/current (push,pull)ability")
Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@amd.com/
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260324191337.1841376-2-jstultz@google.com
---
 kernel/sched/core.c     | 24 ------------------------
 kernel/sched/deadline.c | 18 ++++++++++++++++--
 kernel/sched/rt.c       | 15 ++++++++++++---
 3 files changed, 28 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7c7d4bf..2974168 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6702,23 +6702,6 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 }
 #endif /* SCHED_PROXY_EXEC */
 
-static inline void proxy_tag_curr(struct rq *rq, struct task_struct *owner)
-{
-	if (!sched_proxy_exec())
-		return;
-	/*
-	 * pick_next_task() calls set_next_task() on the chosen task
-	 * at some point, which ensures it is not push/pullable.
-	 * However, the chosen/donor task *and* the mutex owner form an
-	 * atomic pair wrt push/pull.
-	 *
-	 * Make sure owner we run is not pushable. Unfortunately we can
-	 * only deal with that by means of a dequeue/enqueue cycle. :-/
-	 */
-	dequeue_task(rq, owner, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
-	enqueue_task(rq, owner, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
-}
-
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6871,9 +6854,6 @@ keep_resched:
 		 */
 		RCU_INIT_POINTER(rq->curr, next);
 
-		if (!task_current_donor(rq, next))
-			proxy_tag_curr(rq, next);
-
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -6907,10 +6887,6 @@ keep_resched:
 		/* Also unlocks the rq: */
 		rq = context_switch(rq, prev, next, &rf);
 	} else {
-		/* In case next was already curr but just got blocked_donor */
-		if (!task_current_donor(rq, next))
-			proxy_tag_curr(rq, next);
-
 		rq_unpin_lock(rq, &rf);
 		__balance_callbacks(rq, NULL);
 		raw_spin_rq_unlock_irq(rq);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 9e253a8..27359a1 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2805,12 +2805,26 @@ static int find_later_rq(struct task_struct *task)
 
 static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
 {
-	struct task_struct *p;
+	struct task_struct *i, *p = NULL;
+	struct rb_node *next_node;
 
 	if (!has_pushable_dl_tasks(rq))
 		return NULL;
 
-	p = __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
+	next_node = rb_first_cached(&rq->dl.pushable_dl_tasks_root);
+	while (next_node) {
+		i = __node_2_pdl(next_node);
+		/* make sure task isn't on_cpu (possible with proxy-exec) */
+		if (!task_on_cpu(rq, i)) {
+			p = i;
+			break;
+		}
+
+		next_node = rb_next(next_node);
+	}
+
+	if (!p)
+		return NULL;
 
 	WARN_ON_ONCE(rq->cpu != task_cpu(p));
 	WARN_ON_ONCE(task_current(rq, p));
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3d823f5..4e5f195 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1858,13 +1858,22 @@ static int find_lowest_rq(struct task_struct *task)
 
 static struct task_struct *pick_next_pushable_task(struct rq *rq)
 {
-	struct task_struct *p;
+	struct plist_head *head = &rq->rt.pushable_tasks;
+	struct task_struct *i, *p = NULL;
 
 	if (!has_pushable_tasks(rq))
 		return NULL;
 
-	p = plist_first_entry(&rq->rt.pushable_tasks,
-			      struct task_struct, pushable_tasks);
+	plist_for_each_entry(i, head, pushable_tasks) {
+		/* make sure task isn't on_cpu (possible with proxy-exec) */
+		if (!task_on_cpu(rq, i)) {
+			p = i;
+			break;
+		}
+	}
+
+	if (!p)
+		return NULL;
 
 	BUG_ON(rq->cpu != task_cpu(p));
 	BUG_ON(task_current(rq, p));

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v26 02/10] sched: Minimise repeated sched_proxy_exec() checking
  2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
  2026-03-24 19:13 ` [PATCH v26 01/10] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
@ 2026-03-24 19:13 ` John Stultz
  2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
  2026-03-24 19:13 ` [PATCH v26 03/10] sched: Fix potentially missing balancing with Proxy Exec John Stultz
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 56+ messages in thread
From: John Stultz @ 2026-03-24 19:13 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Peter Zijlstra, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Peter noted: Compilers are really bad (as in they utterly refuse)
optimizing (even when marked with __pure) the static branch
things, and will happily emit multiple identical in a row.

So pull out the one obvious sched_proxy_exec() branch in
__schedule() and remove some of the 'implicit' ones in that
path.

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 92b1807c05a4e..dc044a405f83b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6600,11 +6600,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	struct mutex *mutex;
 
 	/* Follow blocked_on chain. */
-	for (p = donor; task_is_blocked(p); p = owner) {
-		mutex = p->blocked_on;
-		/* Something changed in the chain, so pick again */
-		if (!mutex)
-			return NULL;
+	for (p = donor; (mutex = p->blocked_on); p = owner) {
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
 		 * and ensure @owner sticks around.
@@ -6835,12 +6831,14 @@ static void __sched notrace __schedule(int sched_mode)
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
 	rq->next_class = next->sched_class;
-	if (unlikely(task_is_blocked(next))) {
-		next = find_proxy_task(rq, next, &rf);
-		if (!next)
-			goto pick_again;
-		if (next == rq->idle)
-			goto keep_resched;
+	if (sched_proxy_exec()) {
+		if (unlikely(next->blocked_on)) {
+			next = find_proxy_task(rq, next, &rf);
+			if (!next)
+				goto pick_again;
+			if (next == rq->idle)
+				goto keep_resched;
+		}
 	}
 picked:
 	clear_tsk_need_resched(prev);
-- 
2.53.0.1018.g2bb0e51243-goog


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched: Minimise repeated sched_proxy_exec() checking
  2026-03-24 19:13 ` [PATCH v26 02/10] sched: Minimise repeated sched_proxy_exec() checking John Stultz
@ 2026-04-03 12:30   ` tip-bot2 for John Stultz
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for John Stultz @ 2026-04-03 12:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra, John Stultz, K Prateek Nayak, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     37341ec573da7c16fdd45222b1bfb7b421dbdbcb
Gitweb:        https://git.kernel.org/tip/37341ec573da7c16fdd45222b1bfb7b421dbdbcb
Author:        John Stultz <jstultz@google.com>
AuthorDate:    Tue, 24 Mar 2026 19:13:17 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 03 Apr 2026 14:23:38 +02:00

sched: Minimise repeated sched_proxy_exec() checking

Peter noted: Compilers are really bad (as in they utterly refuse)
optimizing (even when marked with __pure) the static branch
things, and will happily emit multiple identical in a row.

So pull out the one obvious sched_proxy_exec() branch in
__schedule() and remove some of the 'implicit' ones in that
path.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-3-jstultz@google.com
---
 kernel/sched/core.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2974168..f3306d3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6597,11 +6597,7 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	struct mutex *mutex;
 
 	/* Follow blocked_on chain. */
-	for (p = donor; task_is_blocked(p); p = owner) {
-		mutex = p->blocked_on;
-		/* Something changed in the chain, so pick again */
-		if (!mutex)
-			return NULL;
+	for (p = donor; (mutex = p->blocked_on); p = owner) {
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
 		 * and ensure @owner sticks around.
@@ -6832,12 +6828,14 @@ pick_again:
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
 	rq->next_class = next->sched_class;
-	if (unlikely(task_is_blocked(next))) {
-		next = find_proxy_task(rq, next, &rf);
-		if (!next)
-			goto pick_again;
-		if (next == rq->idle)
-			goto keep_resched;
+	if (sched_proxy_exec()) {
+		if (unlikely(next->blocked_on)) {
+			next = find_proxy_task(rq, next, &rf);
+			if (!next)
+				goto pick_again;
+			if (next == rq->idle)
+				goto keep_resched;
+		}
 	}
 picked:
 	clear_tsk_need_resched(prev);

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v26 03/10] sched: Fix potentially missing balancing with Proxy Exec
  2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
  2026-03-24 19:13 ` [PATCH v26 01/10] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
  2026-03-24 19:13 ` [PATCH v26 02/10] sched: Minimise repeated sched_proxy_exec() checking John Stultz
@ 2026-03-24 19:13 ` John Stultz
  2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
  2026-03-24 19:13 ` [PATCH v26 04/10] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 56+ messages in thread
From: John Stultz @ 2026-03-24 19:13 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Peter Zijlstra, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

K Prateek pointed out that with Proxy Exec, we may have cases
where we context switch in __schedule(), while the donor remains
the same. This could cause balancing issues, since the
put_prev_set_next() logic short-cuts if (prev == next). With
proxy-exec prev is the previous donor, and next is the next
donor. Should the donor remain the same, but different tasks are
picked to actually run, the shortcut will have avoided enqueuing
the sched class balance callback.

So, if we are context switching, add logic to catch the
same-donor case, and trigger the put_prev/set_next calls to
ensure the balance callbacks get enqueued.

Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Closes: https://lore.kernel.org/lkml/20ea3670-c30a-433b-a07f-c4ff98ae2379@amd.com/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
---
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dc044a405f83b..610e48cdb66a9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6829,9 +6829,11 @@ static void __sched notrace __schedule(int sched_mode)
 
 pick_again:
 	next = pick_next_task(rq, rq->donor, &rf);
-	rq_set_donor(rq, next);
 	rq->next_class = next->sched_class;
 	if (sched_proxy_exec()) {
+		struct task_struct *prev_donor = rq->donor;
+
+		rq_set_donor(rq, next);
 		if (unlikely(next->blocked_on)) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next)
@@ -6839,7 +6841,27 @@ static void __sched notrace __schedule(int sched_mode)
 			if (next == rq->idle)
 				goto keep_resched;
 		}
+		if (rq->donor == prev_donor && prev != next) {
+			struct task_struct *donor = rq->donor;
+			/*
+			 * When transitioning like:
+			 *
+			 *         prev         next
+			 * donor:    B            B
+			 * curr:     A          B or C
+			 *
+			 * then put_prev_set_next_task() will not have done
+			 * anything, since B == B. However, A might have
+			 * missed a RT/DL balance opportunity due to being
+			 * on_cpu.
+			 */
+			donor->sched_class->put_prev_task(rq, donor, donor);
+			donor->sched_class->set_next_task(rq, donor, true);
+		}
+	} else {
+		rq_set_donor(rq, next);
 	}
+
 picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
-- 
2.53.0.1018.g2bb0e51243-goog


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched: Fix potentially missing balancing with Proxy Exec
  2026-03-24 19:13 ` [PATCH v26 03/10] sched: Fix potentially missing balancing with Proxy Exec John Stultz
@ 2026-04-03 12:30   ` tip-bot2 for John Stultz
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for John Stultz @ 2026-04-03 12:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, Peter Zijlstra, John Stultz, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f4fe6be82e6d27349de66a42d6d1b2b11dc97a14
Gitweb:        https://git.kernel.org/tip/f4fe6be82e6d27349de66a42d6d1b2b11dc97a14
Author:        John Stultz <jstultz@google.com>
AuthorDate:    Tue, 24 Mar 2026 19:13:18 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 03 Apr 2026 14:23:39 +02:00

sched: Fix potentially missing balancing with Proxy Exec

K Prateek pointed out that with Proxy Exec, we may have cases
where we context switch in __schedule(), while the donor remains
the same. This could cause balancing issues, since the
put_prev_set_next() logic short-cuts if (prev == next). With
proxy-exec prev is the previous donor, and next is the next
donor. Should the donor remain the same, but different tasks are
picked to actually run, the shortcut will have avoided enqueuing
the sched class balance callback.

So, if we are context switching, add logic to catch the
same-donor case, and trigger the put_prev/set_next calls to
ensure the balance callbacks get enqueued.

Closes: https://lore.kernel.org/lkml/20ea3670-c30a-433b-a07f-c4ff98ae2379@amd.com/
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260324191337.1841376-4-jstultz@google.com
---
 kernel/sched/core.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f3306d3..5b7f378 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6826,9 +6826,11 @@ static void __sched notrace __schedule(int sched_mode)
 
 pick_again:
 	next = pick_next_task(rq, rq->donor, &rf);
-	rq_set_donor(rq, next);
 	rq->next_class = next->sched_class;
 	if (sched_proxy_exec()) {
+		struct task_struct *prev_donor = rq->donor;
+
+		rq_set_donor(rq, next);
 		if (unlikely(next->blocked_on)) {
 			next = find_proxy_task(rq, next, &rf);
 			if (!next)
@@ -6836,7 +6838,27 @@ pick_again:
 			if (next == rq->idle)
 				goto keep_resched;
 		}
+		if (rq->donor == prev_donor && prev != next) {
+			struct task_struct *donor = rq->donor;
+			/*
+			 * When transitioning like:
+			 *
+			 *         prev         next
+			 * donor:    B            B
+			 * curr:     A          B or C
+			 *
+			 * then put_prev_set_next_task() will not have done
+			 * anything, since B == B. However, A might have
+			 * missed a RT/DL balance opportunity due to being
+			 * on_cpu.
+			 */
+			donor->sched_class->put_prev_task(rq, donor, donor);
+			donor->sched_class->set_next_task(rq, donor, true);
+		}
+	} else {
+		rq_set_donor(rq, next);
 	}
+
 picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v26 04/10] locking: Add task::blocked_lock to serialize blocked_on state
  2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
                   ` (2 preceding siblings ...)
  2026-03-24 19:13 ` [PATCH v26 03/10] sched: Fix potentially missing balancing with Proxy Exec John Stultz
@ 2026-03-24 19:13 ` John Stultz
  2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
  2026-03-24 19:13 ` [PATCH v26 05/10] sched: Fix modifying donor->blocked on without proper locking John Stultz
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 56+ messages in thread
From: John Stultz @ 2026-03-24 19:13 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

So far, we have been able to utilize the mutex::wait_lock
for serializing the blocked_on state, but when we move to
proxying across runqueues, we will need to add more state
and a way to serialize changes to this state in contexts
where we don't hold the mutex::wait_lock.

So introduce the task::blocked_lock, which nests under the
mutex::wait_lock in the locking order, and rework the locking
to use it.

Signed-off-by: John Stultz <jstultz@google.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
v15:
* Split back out into later in the series
v16:
* Fixups to mark tasks unblocked before sleeping in
  mutex_optimistic_spin()
* Rework to use guard() as suggested by Peter
v19:
* Rework logic for PREEMPT_RT issues reported by
  K Prateek Nayak
v21:
* After recently thinking more on ww_mutex code, I
  reworked the blocked_lock usage in mutex lock to
  avoid having to take nested locks in the ww_mutex
  paths, as I was concerned the lock ordering
  constraints weren't as strong as I had previously
  thought.
v22:
* Added some extra spaces to avoid dense code blocks
  suggested by K Prateek
v23:
* Move get_task_blocked_on() to kernel/locking/mutex.h
  as requested by PeterZ

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h        | 48 +++++++++++++-----------------------
 init/init_task.c             |  1 +
 kernel/fork.c                |  1 +
 kernel/locking/mutex-debug.c |  4 +--
 kernel/locking/mutex.c       | 40 +++++++++++++++++++-----------
 kernel/locking/mutex.h       |  6 +++++
 kernel/locking/ww_mutex.h    |  4 +--
 kernel/sched/core.c          |  4 ++-
 8 files changed, 58 insertions(+), 50 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5a5d3dbc9cdf3..2eef9bc6daaab 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1238,6 +1238,7 @@ struct task_struct {
 #endif
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
 	/*
@@ -2181,57 +2182,42 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
 #ifndef CONFIG_PREEMPT_RT
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
-	struct mutex *m = p->blocked_on;
-
-	if (m)
-		lockdep_assert_held_once(&m->wait_lock);
-	return m;
+	lockdep_assert_held_once(&p->blocked_lock);
+	return p->blocked_on;
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
 	WARN_ON_ONCE(!m);
 	/* The task should only be setting itself as blocked */
 	WARN_ON_ONCE(p != current);
-	/* Currently we serialize blocked_on under the mutex::wait_lock */
-	lockdep_assert_held_once(&m->wait_lock);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
 	/*
 	 * Check ensure we don't overwrite existing mutex value
 	 * with a different mutex. Note, setting it to the same
 	 * lock repeatedly is ok.
 	 */
-	WARN_ON_ONCE(blocked_on && blocked_on != m);
-	WRITE_ONCE(p->blocked_on, m);
-}
-
-static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
-{
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
-	__set_task_blocked_on(p, m);
+	WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
+	p->blocked_on = m;
 }
 
 static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	if (m) {
-		struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
-		/* Currently we serialize blocked_on under the mutex::wait_lock */
-		lockdep_assert_held_once(&m->wait_lock);
-		/*
-		 * There may be cases where we re-clear already cleared
-		 * blocked_on relationships, but make sure we are not
-		 * clearing the relationship with a different lock.
-		 */
-		WARN_ON_ONCE(blocked_on && blocked_on != m);
-	}
-	WRITE_ONCE(p->blocked_on, NULL);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+	/*
+	 * There may be cases where we re-clear already cleared
+	 * blocked_on relationships, but make sure we are not
+	 * clearing the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	p->blocked_on = NULL;
 }
 
 static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
 #else
diff --git a/init/init_task.c b/init/init_task.c
index 5c838757fc10e..b5f48ebdc2b6e 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -169,6 +169,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+	.blocked_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
 	.timer_slack_ns = 50000, /* 50 usec default slack */
 	.thread_pid	= &init_struct_pid,
 	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
diff --git a/kernel/fork.c b/kernel/fork.c
index bc2bf58b93b65..079802cb61002 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2076,6 +2076,7 @@ __latent_entropy struct task_struct *copy_process(
 	ftrace_graph_init_task(p);
 
 	rt_mutex_init_task(p);
+	raw_spin_lock_init(&p->blocked_lock);
 
 	lockdep_assert_irqs_enabled();
 #ifdef CONFIG_PROVE_LOCKING
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 2c6b02d4699be..cc6aa9c6e9813 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 	lockdep_assert_held(&lock->wait_lock);
 
 	/* Current thread can't be already blocked (since it's executing!) */
-	DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task));
+	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
-	struct mutex *blocked_on = __get_task_blocked_on(task);
+	struct mutex *blocked_on = get_task_blocked_on(task);
 
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2a1d165b3167e..4aa79bcab08c7 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -656,6 +656,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
+	raw_spin_lock(&current->blocked_lock);
 	__set_task_blocked_on(current, lock);
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
@@ -669,8 +670,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 * the handoff.
 		 */
 		if (__mutex_trylock(lock))
-			goto acquired;
+			break;
 
+		raw_spin_unlock(&current->blocked_lock);
 		/*
 		 * Check for signals and kill conditions while holding
 		 * wait_lock. This ensures the lock cancellation is ordered
@@ -693,12 +695,14 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
+		raw_spin_lock(&current->blocked_lock);
 		/*
 		 * As we likely have been woken up by task
 		 * that has cleared our blocked_on state, re-set
 		 * it to the lock we are trying to acquire.
 		 */
-		set_task_blocked_on(current, lock);
+		__set_task_blocked_on(current, lock);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -709,25 +713,33 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			break;
 
 		if (first) {
-			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			bool opt_acquired;
+
 			/*
 			 * mutex_optimistic_spin() can call schedule(), so
-			 * clear blocked on so we don't become unselectable
+			 * we need to release these locks before calling it,
+			 * and clear blocked on so we don't become unselectable
 			 * to run.
 			 */
-			clear_task_blocked_on(current, lock);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			__clear_task_blocked_on(current, lock);
+			raw_spin_unlock(&current->blocked_lock);
+			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+
+			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
+
+			raw_spin_lock_irqsave(&lock->wait_lock, flags);
+			raw_spin_lock(&current->blocked_lock);
+			__set_task_blocked_on(current, lock);
+
+			if (opt_acquired)
 				break;
-			set_task_blocked_on(current, lock);
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
-
-		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-acquired:
 	__clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
+	raw_spin_unlock(&current->blocked_lock);
 
 	if (ww_ctx) {
 		/*
@@ -756,11 +768,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 	return 0;
 
 err:
-	__clear_task_blocked_on(current, lock);
+	clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
-	WARN_ON(__get_task_blocked_on(current));
+	WARN_ON(get_task_blocked_on(current));
 	trace_contention_end(lock, ret);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 	debug_mutex_free_waiter(&waiter);
@@ -971,7 +983,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		__clear_task_blocked_on(next, lock);
+		clear_task_blocked_on(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 9ad4da8cea004..7a8ba13fee949 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -47,6 +47,12 @@ static inline struct task_struct *__mutex_owner(struct mutex *lock)
 	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
 }
 
+static inline struct mutex *get_task_blocked_on(struct task_struct *p)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	return __get_task_blocked_on(p);
+}
+
 #ifdef CONFIG_DEBUG_MUTEXES
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 31a785afee6c0..e4a81790ea7dd 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -289,7 +289,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		 * blocked_on pointer. Otherwise we can see circular
 		 * blocked_on relationships that can't resolve.
 		 */
-		__clear_task_blocked_on(waiter->task, lock);
+		clear_task_blocked_on(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -347,7 +347,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			__clear_task_blocked_on(owner, NULL);
+			clear_task_blocked_on(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 610e48cdb66a9..7187c63174cd2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6587,6 +6587,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
  *   p->pi_lock
  *     rq->lock
  *       mutex->wait_lock
+ *         p->blocked_lock
  *
  * Returns the task that is going to be used as execution context (the one
  * that is actually going to be run on cpu_of(rq)).
@@ -6606,8 +6607,9 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * and ensure @owner sticks around.
 		 */
 		guard(raw_spinlock)(&mutex->wait_lock);
+		guard(raw_spinlock)(&p->blocked_lock);
 
-		/* Check again that p is blocked with wait_lock held */
+		/* Check again that p is blocked with blocked_lock held */
 		if (mutex != __get_task_blocked_on(p)) {
 			/*
 			 * Something changed in the blocked_on chain and
-- 
2.53.0.1018.g2bb0e51243-goog


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] locking: Add task::blocked_lock to serialize blocked_on state
  2026-03-24 19:13 ` [PATCH v26 04/10] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
@ 2026-04-03 12:30   ` tip-bot2 for John Stultz
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for John Stultz @ 2026-04-03 12:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: John Stultz, Peter Zijlstra (Intel), K Prateek Nayak, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     fa4a1ff8ab235a308d8c983827657a69649185fd
Gitweb:        https://git.kernel.org/tip/fa4a1ff8ab235a308d8c983827657a69649185fd
Author:        John Stultz <jstultz@google.com>
AuthorDate:    Tue, 24 Mar 2026 19:13:19 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 03 Apr 2026 14:23:39 +02:00

locking: Add task::blocked_lock to serialize blocked_on state

So far, we have been able to utilize the mutex::wait_lock
for serializing the blocked_on state, but when we move to
proxying across runqueues, we will need to add more state
and a way to serialize changes to this state in contexts
where we don't hold the mutex::wait_lock.

So introduce the task::blocked_lock, which nests under the
mutex::wait_lock in the locking order, and rework the locking
to use it.

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-5-jstultz@google.com
---
 include/linux/sched.h        | 48 ++++++++++++-----------------------
 init/init_task.c             |  1 +-
 kernel/fork.c                |  1 +-
 kernel/locking/mutex-debug.c |  4 +--
 kernel/locking/mutex.c       | 40 ++++++++++++++++++-----------
 kernel/locking/mutex.h       |  6 ++++-
 kernel/locking/ww_mutex.h    |  4 +--
 kernel/sched/core.c          |  4 ++-
 8 files changed, 58 insertions(+), 50 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5a5d3db..2eef9bc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1238,6 +1238,7 @@ struct task_struct {
 #endif
 
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	raw_spinlock_t			blocked_lock;
 
 #ifdef CONFIG_DETECT_HUNG_TASK_BLOCKER
 	/*
@@ -2181,57 +2182,42 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
 #ifndef CONFIG_PREEMPT_RT
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
-	struct mutex *m = p->blocked_on;
-
-	if (m)
-		lockdep_assert_held_once(&m->wait_lock);
-	return m;
+	lockdep_assert_held_once(&p->blocked_lock);
+	return p->blocked_on;
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
 	WARN_ON_ONCE(!m);
 	/* The task should only be setting itself as blocked */
 	WARN_ON_ONCE(p != current);
-	/* Currently we serialize blocked_on under the mutex::wait_lock */
-	lockdep_assert_held_once(&m->wait_lock);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
 	/*
 	 * Check ensure we don't overwrite existing mutex value
 	 * with a different mutex. Note, setting it to the same
 	 * lock repeatedly is ok.
 	 */
-	WARN_ON_ONCE(blocked_on && blocked_on != m);
-	WRITE_ONCE(p->blocked_on, m);
-}
-
-static inline void set_task_blocked_on(struct task_struct *p, struct mutex *m)
-{
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
-	__set_task_blocked_on(p, m);
+	WARN_ON_ONCE(p->blocked_on && p->blocked_on != m);
+	p->blocked_on = m;
 }
 
 static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	if (m) {
-		struct mutex *blocked_on = READ_ONCE(p->blocked_on);
-
-		/* Currently we serialize blocked_on under the mutex::wait_lock */
-		lockdep_assert_held_once(&m->wait_lock);
-		/*
-		 * There may be cases where we re-clear already cleared
-		 * blocked_on relationships, but make sure we are not
-		 * clearing the relationship with a different lock.
-		 */
-		WARN_ON_ONCE(blocked_on && blocked_on != m);
-	}
-	WRITE_ONCE(p->blocked_on, NULL);
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+	/*
+	 * There may be cases where we re-clear already cleared
+	 * blocked_on relationships, but make sure we are not
+	 * clearing the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	p->blocked_on = NULL;
 }
 
 static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 {
-	guard(raw_spinlock_irqsave)(&m->wait_lock);
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
 #else
diff --git a/init/init_task.c b/init/init_task.c
index 5c83875..b5f48eb 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -169,6 +169,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+	.blocked_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
 	.timer_slack_ns = 50000, /* 50 usec default slack */
 	.thread_pid	= &init_struct_pid,
 	.thread_node	= LIST_HEAD_INIT(init_signals.thread_head),
diff --git a/kernel/fork.c b/kernel/fork.c
index bc2bf58..079802c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2076,6 +2076,7 @@ __latent_entropy struct task_struct *copy_process(
 	ftrace_graph_init_task(p);
 
 	rt_mutex_init_task(p);
+	raw_spin_lock_init(&p->blocked_lock);
 
 	lockdep_assert_irqs_enabled();
 #ifdef CONFIG_PROVE_LOCKING
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index 2c6b02d..cc6aa9c 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -54,13 +54,13 @@ void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 	lockdep_assert_held(&lock->wait_lock);
 
 	/* Current thread can't be already blocked (since it's executing!) */
-	DEBUG_LOCKS_WARN_ON(__get_task_blocked_on(task));
+	DEBUG_LOCKS_WARN_ON(get_task_blocked_on(task));
 }
 
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *waiter,
 			 struct task_struct *task)
 {
-	struct mutex *blocked_on = __get_task_blocked_on(task);
+	struct mutex *blocked_on = get_task_blocked_on(task);
 
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task != task);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 2a1d165..4aa79bc 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -656,6 +656,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			goto err_early_kill;
 	}
 
+	raw_spin_lock(&current->blocked_lock);
 	__set_task_blocked_on(current, lock);
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
@@ -669,8 +670,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 		 * the handoff.
 		 */
 		if (__mutex_trylock(lock))
-			goto acquired;
+			break;
 
+		raw_spin_unlock(&current->blocked_lock);
 		/*
 		 * Check for signals and kill conditions while holding
 		 * wait_lock. This ensures the lock cancellation is ordered
@@ -693,12 +695,14 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 
 		first = __mutex_waiter_is_first(lock, &waiter);
 
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
+		raw_spin_lock(&current->blocked_lock);
 		/*
 		 * As we likely have been woken up by task
 		 * that has cleared our blocked_on state, re-set
 		 * it to the lock we are trying to acquire.
 		 */
-		set_task_blocked_on(current, lock);
+		__set_task_blocked_on(current, lock);
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -709,25 +713,33 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
 			break;
 
 		if (first) {
-			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			bool opt_acquired;
+
 			/*
 			 * mutex_optimistic_spin() can call schedule(), so
-			 * clear blocked on so we don't become unselectable
+			 * we need to release these locks before calling it,
+			 * and clear blocked on so we don't become unselectable
 			 * to run.
 			 */
-			clear_task_blocked_on(current, lock);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			__clear_task_blocked_on(current, lock);
+			raw_spin_unlock(&current->blocked_lock);
+			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
+
+			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
+			opt_acquired = mutex_optimistic_spin(lock, ww_ctx, &waiter);
+
+			raw_spin_lock_irqsave(&lock->wait_lock, flags);
+			raw_spin_lock(&current->blocked_lock);
+			__set_task_blocked_on(current, lock);
+
+			if (opt_acquired)
 				break;
-			set_task_blocked_on(current, lock);
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
-
-		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock_irqsave(&lock->wait_lock, flags);
-acquired:
 	__clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
+	raw_spin_unlock(&current->blocked_lock);
 
 	if (ww_ctx) {
 		/*
@@ -756,11 +768,11 @@ skip_wait:
 	return 0;
 
 err:
-	__clear_task_blocked_on(current, lock);
+	clear_task_blocked_on(current, lock);
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
-	WARN_ON(__get_task_blocked_on(current));
+	WARN_ON(get_task_blocked_on(current));
 	trace_contention_end(lock, ret);
 	raw_spin_unlock_irqrestore_wake(&lock->wait_lock, flags, &wake_q);
 	debug_mutex_free_waiter(&waiter);
@@ -971,7 +983,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		__clear_task_blocked_on(next, lock);
+		clear_task_blocked_on(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 9ad4da8..7a8ba13 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -47,6 +47,12 @@ static inline struct task_struct *__mutex_owner(struct mutex *lock)
 	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
 }
 
+static inline struct mutex *get_task_blocked_on(struct task_struct *p)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	return __get_task_blocked_on(p);
+}
+
 #ifdef CONFIG_DEBUG_MUTEXES
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 31a785a..e4a8179 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -289,7 +289,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		 * blocked_on pointer. Otherwise we can see circular
 		 * blocked_on relationships that can't resolve.
 		 */
-		__clear_task_blocked_on(waiter->task, lock);
+		clear_task_blocked_on(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -347,7 +347,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			__clear_task_blocked_on(owner, NULL);
+			clear_task_blocked_on(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b7f378..1913dbc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6584,6 +6584,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
  *   p->pi_lock
  *     rq->lock
  *       mutex->wait_lock
+ *         p->blocked_lock
  *
  * Returns the task that is going to be used as execution context (the one
  * that is actually going to be run on cpu_of(rq)).
@@ -6603,8 +6604,9 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * and ensure @owner sticks around.
 		 */
 		guard(raw_spinlock)(&mutex->wait_lock);
+		guard(raw_spinlock)(&p->blocked_lock);
 
-		/* Check again that p is blocked with wait_lock held */
+		/* Check again that p is blocked with blocked_lock held */
 		if (mutex != __get_task_blocked_on(p)) {
 			/*
 			 * Something changed in the blocked_on chain and

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v26 05/10] sched: Fix modifying donor->blocked on without proper locking
  2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
                   ` (3 preceding siblings ...)
  2026-03-24 19:13 ` [PATCH v26 04/10] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
@ 2026-03-24 19:13 ` John Stultz
  2026-03-26 21:45   ` Steven Rostedt
  2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
  2026-03-24 19:13 ` [PATCH v26 06/10] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 56+ messages in thread
From: John Stultz @ 2026-03-24 19:13 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

Introduce an action enum in find_proxy_task() which allows
us to handle work needed to be done outside the mutex.wait_lock
and task.blocked_lock guard scopes.

This ensures proper locking when we clear the donor's blocked_on
pointer in proxy_deactivate(), and the switch statement will be
useful as we add more cases to handle later in this series.

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v23:
* Split out from earlier patch.
v24:
* Minor re-ordering local variables to keep with style
  as suggested by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7187c63174cd2..c43e7926fda51 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6571,7 +6571,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 		 * as unblocked, as we aren't doing proxy-migrations
 		 * yet (more logic will be needed then).
 		 */
-		donor->blocked_on = NULL;
+		clear_task_blocked_on(donor, NULL);
 	}
 	return NULL;
 }
@@ -6595,6 +6595,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
+	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
 	struct task_struct *owner = NULL;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
@@ -6628,12 +6629,14 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
-			return proxy_deactivate(rq, donor);
+			action = DEACTIVATE_DONOR;
+			break;
 		}
 
 		if (task_cpu(owner) != this_cpu) {
 			/* XXX Don't handle migrations yet */
-			return proxy_deactivate(rq, donor);
+			action = DEACTIVATE_DONOR;
+			break;
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6691,6 +6694,13 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 */
 	}
 
+	/* Handle actions we need to do outside of the guard() scope */
+	switch (action) {
+	case DEACTIVATE_DONOR:
+		return proxy_deactivate(rq, donor);
+	case FOUND:
+		/* fallthrough */;
+	}
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
 }
-- 
2.53.0.1018.g2bb0e51243-goog


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 05/10] sched: Fix modifying donor->blocked on without proper locking
  2026-03-24 19:13 ` [PATCH v26 05/10] sched: Fix modifying donor->blocked on without proper locking John Stultz
@ 2026-03-26 21:45   ` Steven Rostedt
  2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
  1 sibling, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2026-03-26 21:45 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, K Prateek Nayak, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team


Nit, Subject needs s/donor->blocked on/donor->blocked_on/

On Tue, 24 Mar 2026 19:13:20 +0000
John Stultz <jstultz@google.com> wrote:

> @@ -6595,6 +6595,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
>  static struct task_struct *
>  find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  {
> +	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
>  	struct task_struct *owner = NULL;
>  	int this_cpu = cpu_of(rq);
>  	struct task_struct *p;
> @@ -6628,12 +6629,14 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  
>  		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
>  			/* XXX Don't handle blocked owners/delayed dequeue yet */
> -			return proxy_deactivate(rq, donor);
> +			action = DEACTIVATE_DONOR;
> +			break;
>  		}
>  
>  		if (task_cpu(owner) != this_cpu) {
>  			/* XXX Don't handle migrations yet */
> -			return proxy_deactivate(rq, donor);
> +			action = DEACTIVATE_DONOR;
> +			break;
>  		}
>  
>  		if (task_on_rq_migrating(owner)) {
> @@ -6691,6 +6694,13 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  		 */
>  	}
>  
> +	/* Handle actions we need to do outside of the guard() scope */
> +	switch (action) {
> +	case DEACTIVATE_DONOR:
> +		return proxy_deactivate(rq, donor);
> +	case FOUND:
> +		/* fallthrough */;

A fall through comment for exiting the switch statement is rather
confusing. Fallthrough usually means to fall into the next case statement.
You could just do:

	switch (action) {
	case DEACTIVATE_DONOR:
		return proxy_deactivate(rq, donor);
	case FOUND:
		break;
	}

Which is what I believe is the normal method of adding enums to switch
statements that don't do anything.

-- Steve



> +	}
>  	WARN_ON_ONCE(owner && !owner->on_rq);
>  	return owner;
>  }


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched: Fix modifying donor->blocked on without proper locking
  2026-03-24 19:13 ` [PATCH v26 05/10] sched: Fix modifying donor->blocked on without proper locking John Stultz
  2026-03-26 21:45   ` Steven Rostedt
@ 2026-04-03 12:30   ` tip-bot2 for John Stultz
  1 sibling, 0 replies; 56+ messages in thread
From: tip-bot2 for John Stultz @ 2026-04-03 12:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: John Stultz, Peter Zijlstra (Intel), K Prateek Nayak, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     56f4b24267a643b0b9ab73f09feaaabfee5a37ae
Gitweb:        https://git.kernel.org/tip/56f4b24267a643b0b9ab73f09feaaabfee5a37ae
Author:        John Stultz <jstultz@google.com>
AuthorDate:    Tue, 24 Mar 2026 19:13:20 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 03 Apr 2026 14:23:39 +02:00

sched: Fix modifying donor->blocked on without proper locking

Introduce an action enum in find_proxy_task() which allows
us to handle work needed to be done outside the mutex.wait_lock
and task.blocked_lock guard scopes.

This ensures proper locking when we clear the donor's blocked_on
pointer in proxy_deactivate(), and the switch statement will be
useful as we add more cases to handle later in this series.

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-6-jstultz@google.com
---
 kernel/sched/core.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1913dbc..bf4338f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6568,7 +6568,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 		 * as unblocked, as we aren't doing proxy-migrations
 		 * yet (more logic will be needed then).
 		 */
-		donor->blocked_on = NULL;
+		clear_task_blocked_on(donor, NULL);
 	}
 	return NULL;
 }
@@ -6592,6 +6592,7 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 {
+	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
 	struct task_struct *owner = NULL;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
@@ -6625,12 +6626,14 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
-			return proxy_deactivate(rq, donor);
+			action = DEACTIVATE_DONOR;
+			break;
 		}
 
 		if (task_cpu(owner) != this_cpu) {
 			/* XXX Don't handle migrations yet */
-			return proxy_deactivate(rq, donor);
+			action = DEACTIVATE_DONOR;
+			break;
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6688,6 +6691,13 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 */
 	}
 
+	/* Handle actions we need to do outside of the guard() scope */
+	switch (action) {
+	case DEACTIVATE_DONOR:
+		return proxy_deactivate(rq, donor);
+	case FOUND:
+		/* fallthrough */;
+	}
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
 }

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v26 06/10] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
  2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
                   ` (4 preceding siblings ...)
  2026-03-24 19:13 ` [PATCH v26 05/10] sched: Fix modifying donor->blocked on without proper locking John Stultz
@ 2026-03-24 19:13 ` John Stultz
  2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
  2026-03-24 19:13 ` [PATCH v26 07/10] sched: Add assert_balance_callbacks_empty helper John Stultz
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 56+ messages in thread
From: John Stultz @ 2026-03-24 19:13 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

As we add functionality to proxy execution, we may migrate a
donor task to a runqueue where it can't run due to cpu affinity.
Thus, we must be careful to ensure we return-migrate the task
back to a cpu in its cpumask when it becomes unblocked.

Peter helpfully provided the following example with pictures:
"Suppose we have a ww_mutex cycle:

                  ,-+-* Mutex-1 <-.
        Task-A ---' |             | ,-- Task-B
                    `-> Mutex-2 *-+-'

Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
where Task-B holds Mutex-2 and tries to acquire Mutex-1.

Then the blocked_on->owner chain will go in circles.

        Task-A  -> Mutex-2
          ^          |
          |          v
        Mutex-1 <- Task-B

We need two things:

 - find_proxy_task() to stop iterating the circle;

 - the woken task to 'unblock' and run, such that it can
   back-off and re-try the transaction.

Now, the current code [without this patch] does:
        __clear_task_blocked_on();
        wake_q_add();

And surely clearing ->blocked_on is sufficient to break the
cycle.

Suppose it is Task-B that is made to back-off, then we have:

  Task-A -> Mutex-2 -> Task-B (no further blocked_on)

and it would attempt to run Task-B. Or worse, it could directly
pick Task-B and run it, without ever getting into
find_proxy_task().

Now, here is a problem because Task-B might not be runnable on
the CPU it is currently on; and because !task_is_blocked() we
don't get into the proxy paths, so nobody is going to fix this
up.

Ideally we would have dequeued Task-B alongside of clearing
->blocked_on, but alas, [the lock ordering prevents us from
getting the task_rq_lock() and] spoils things."

Thus we need more than just a binary concept of the task being
blocked on a mutex or not.

So allow setting blocked_on to PROXY_WAKING as a special value
which specifies the task is no longer blocked, but needs to
be evaluated for return migration *before* it can be run.

This will then be used in a later patch to handle proxy
return-migration.

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v15:
* Split blocked_on_state into its own patch later in the
  series, as the tri-state isn't necessary until we deal
  with proxy/return migrations
v16:
* Handle case where task in the chain is being set as
  BO_WAKING by another cpu (usually via ww_mutex die code).
  Make sure we release the rq lock so the wakeup can
  complete.
* Rework to use guard() in find_proxy_task() as suggested
  by Peter
v18:
* Add initialization of blocked_on_state for init_task
v19:
* PREEMPT_RT build fixups and rework suggested by
  K Prateek Nayak
v20:
* Simplify one of the blocked_on_state changes to avoid extra
  PREMEPT_RT conditionals
v21:
* Slight reworks due to avoiding nested blocked_lock locking
* Be consistent in use of blocked_on_state helper functions
* Rework calls to proxy_deactivate() to do proper locking
  around blocked_on_state changes that we were cheating in
  previous versions.
* Minor cleanups, some comment improvements
v22:
* Re-order blocked_on_state helpers to try to make it clearer
  the set_task_blocked_on() and clear_task_blocked_on() are
  the main enter/exit states and the blocked_on_state helpers
  help manage the transition states within. Per feedback from
  K Prateek Nayak.
* Rework blocked_on_state to be defined within
  CONFIG_SCHED_PROXY_EXEC as suggested by K Prateek Nayak.
* Reworked empty stub functions to just take one line as
  suggestd by K Prateek
* Avoid using gotos out of a guard() scope, as highlighted by
  K Prateek, and instead rework logic to break and switch()
  on an action value.
v23:
* Big rework to using PROXY_WAKING instead of blocked_on_state
  as suggested by Peter.
* Reworked commit message to include Peter's nice diagrams and
  example for why this extra state is necessary.

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 include/linux/sched.h     | 51 +++++++++++++++++++++++++++++++++++++--
 kernel/locking/mutex.c    |  2 +-
 kernel/locking/ww_mutex.h | 16 ++++++------
 kernel/sched/core.c       | 16 ++++++++++++
 4 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2eef9bc6daaab..8ec3b6d7d718b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2180,10 +2180,20 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
 })
 
 #ifndef CONFIG_PREEMPT_RT
+
+/*
+ * With proxy exec, if a task has been proxy-migrated, it may be a donor
+ * on a cpu that it can't actually run on. Thus we need a special state
+ * to denote that the task is being woken, but that it needs to be
+ * evaluated for return-migration before it is run. So if the task is
+ * blocked_on PROXY_WAKING, return migrate it before running it.
+ */
+#define PROXY_WAKING ((struct mutex *)(-1L))
+
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
 	lockdep_assert_held_once(&p->blocked_lock);
-	return p->blocked_on;
+	return p->blocked_on == PROXY_WAKING ? NULL : p->blocked_on;
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2211,7 +2221,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
 	 * blocked_on relationships, but make sure we are not
 	 * clearing the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
 	p->blocked_on = NULL;
 }
 
@@ -2220,6 +2230,35 @@ static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
+
+static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
+{
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+
+	if (!sched_proxy_exec()) {
+		__clear_task_blocked_on(p, m);
+		return;
+	}
+
+	/* Don't set PROXY_WAKING if blocked_on was already cleared */
+	if (!p->blocked_on)
+		return;
+	/*
+	 * There may be cases where we set PROXY_WAKING on tasks that were
+	 * already set to waking, but make sure we are not changing
+	 * the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	p->blocked_on = PROXY_WAKING;
+}
+
+static inline void set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	__set_task_blocked_on_waking(p, m);
+}
+
 #else
 static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
@@ -2228,6 +2267,14 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mute
 static inline void clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
 }
+
+static inline void __set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
+{
+}
+
+static inline void set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
+{
+}
 #endif /* !CONFIG_PREEMPT_RT */
 
 static __always_inline bool need_resched(void)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 4aa79bcab08c7..7d359647156df 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -983,7 +983,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		clear_task_blocked_on(next, lock);
+		set_task_blocked_on_waking(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index e4a81790ea7dd..5cd9dfa4b31e6 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
 		/*
-		 * When waking up the task to die, be sure to clear the
-		 * blocked_on pointer. Otherwise we can see circular
-		 * blocked_on relationships that can't resolve.
+		 * When waking up the task to die, be sure to set the
+		 * blocked_on to PROXY_WAKING. Otherwise we can see
+		 * circular blocked_on relationships that can't resolve.
 		 */
-		clear_task_blocked_on(waiter->task, lock);
+		set_task_blocked_on_waking(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -339,15 +339,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 */
 		if (owner != current) {
 			/*
-			 * When waking up the task to wound, be sure to clear the
-			 * blocked_on pointer. Otherwise we can see circular
-			 * blocked_on relationships that can't resolve.
+			 * When waking up the task to wound, be sure to set the
+			 * blocked_on to PROXY_WAKING. Otherwise we can see
+			 * circular blocked_on relationships that can't resolve.
 			 *
 			 * NOTE: We pass NULL here instead of lock, because we
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			clear_task_blocked_on(owner, NULL);
+			set_task_blocked_on_waking(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c43e7926fda51..aa2e7287235e3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4242,6 +4242,13 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
+	/*
+	 * For now, if we've been woken up, clear the task->blocked_on
+	 * regardless if it was set to a mutex or PROXY_WAKING so the
+	 * task can run. We will need to be more careful later when
+	 * properly handling proxy migration
+	 */
+	clear_task_blocked_on(p, NULL);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6603,6 +6610,10 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 	/* Follow blocked_on chain. */
 	for (p = donor; (mutex = p->blocked_on); p = owner) {
+		/* if its PROXY_WAKING, resched_idle so ttwu can complete */
+		if (mutex == PROXY_WAKING)
+			return proxy_resched_idle(rq);
+
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
 		 * and ensure @owner sticks around.
@@ -6623,6 +6634,11 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 		owner = __mutex_owner(mutex);
 		if (!owner) {
+			/*
+			 * If there is no owner, clear blocked_on
+			 * and return p so it can run and try to
+			 * acquire the lock
+			 */
 			__clear_task_blocked_on(p, mutex);
 			return p;
 		}
-- 
2.53.0.1018.g2bb0e51243-goog


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration
  2026-03-24 19:13 ` [PATCH v26 06/10] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
@ 2026-04-03 12:30   ` tip-bot2 for John Stultz
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for John Stultz @ 2026-04-03 12:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: John Stultz, Peter Zijlstra (Intel), K Prateek Nayak, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2d7622669836dcbbb449741b4e6c503ffe005c25
Gitweb:        https://git.kernel.org/tip/2d7622669836dcbbb449741b4e6c503ffe005c25
Author:        John Stultz <jstultz@google.com>
AuthorDate:    Tue, 24 Mar 2026 19:13:21 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 03 Apr 2026 14:23:40 +02:00

sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration

As we add functionality to proxy execution, we may migrate a
donor task to a runqueue where it can't run due to cpu affinity.
Thus, we must be careful to ensure we return-migrate the task
back to a cpu in its cpumask when it becomes unblocked.

Peter helpfully provided the following example with pictures:
"Suppose we have a ww_mutex cycle:

                  ,-+-* Mutex-1 <-.
        Task-A ---' |             | ,-- Task-B
                    `-> Mutex-2 *-+-'

Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and
where Task-B holds Mutex-2 and tries to acquire Mutex-1.

Then the blocked_on->owner chain will go in circles.

        Task-A  -> Mutex-2
          ^          |
          |          v
        Mutex-1 <- Task-B

We need two things:

 - find_proxy_task() to stop iterating the circle;

 - the woken task to 'unblock' and run, such that it can
   back-off and re-try the transaction.

Now, the current code [without this patch] does:
        __clear_task_blocked_on();
        wake_q_add();

And surely clearing ->blocked_on is sufficient to break the
cycle.

Suppose it is Task-B that is made to back-off, then we have:

  Task-A -> Mutex-2 -> Task-B (no further blocked_on)

and it would attempt to run Task-B. Or worse, it could directly
pick Task-B and run it, without ever getting into
find_proxy_task().

Now, here is a problem because Task-B might not be runnable on
the CPU it is currently on; and because !task_is_blocked() we
don't get into the proxy paths, so nobody is going to fix this
up.

Ideally we would have dequeued Task-B alongside of clearing
->blocked_on, but alas, [the lock ordering prevents us from
getting the task_rq_lock() and] spoils things."

Thus we need more than just a binary concept of the task being
blocked on a mutex or not.

So allow setting blocked_on to PROXY_WAKING as a special value
which specifies the task is no longer blocked, but needs to
be evaluated for return migration *before* it can be run.

This will then be used in a later patch to handle proxy
return-migration.

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-7-jstultz@google.com
---
 include/linux/sched.h     | 51 ++++++++++++++++++++++++++++++++++++--
 kernel/locking/mutex.c    |  2 +-
 kernel/locking/ww_mutex.h | 16 ++++++------
 kernel/sched/core.c       | 16 ++++++++++++-
 4 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2eef9bc..8ec3b6d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2180,10 +2180,20 @@ extern int __cond_resched_rwlock_write(rwlock_t *lock) __must_hold(lock);
 })
 
 #ifndef CONFIG_PREEMPT_RT
+
+/*
+ * With proxy exec, if a task has been proxy-migrated, it may be a donor
+ * on a cpu that it can't actually run on. Thus we need a special state
+ * to denote that the task is being woken, but that it needs to be
+ * evaluated for return-migration before it is run. So if the task is
+ * blocked_on PROXY_WAKING, return migrate it before running it.
+ */
+#define PROXY_WAKING ((struct mutex *)(-1L))
+
 static inline struct mutex *__get_task_blocked_on(struct task_struct *p)
 {
 	lockdep_assert_held_once(&p->blocked_lock);
-	return p->blocked_on;
+	return p->blocked_on == PROXY_WAKING ? NULL : p->blocked_on;
 }
 
 static inline void __set_task_blocked_on(struct task_struct *p, struct mutex *m)
@@ -2211,7 +2221,7 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct mutex *
 	 * blocked_on relationships, but make sure we are not
 	 * clearing the relationship with a different lock.
 	 */
-	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m);
+	WARN_ON_ONCE(m && p->blocked_on && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
 	p->blocked_on = NULL;
 }
 
@@ -2220,6 +2230,35 @@ static inline void clear_task_blocked_on(struct task_struct *p, struct mutex *m)
 	guard(raw_spinlock_irqsave)(&p->blocked_lock);
 	__clear_task_blocked_on(p, m);
 }
+
+static inline void __set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
+{
+	/* Currently we serialize blocked_on under the task::blocked_lock */
+	lockdep_assert_held_once(&p->blocked_lock);
+
+	if (!sched_proxy_exec()) {
+		__clear_task_blocked_on(p, m);
+		return;
+	}
+
+	/* Don't set PROXY_WAKING if blocked_on was already cleared */
+	if (!p->blocked_on)
+		return;
+	/*
+	 * There may be cases where we set PROXY_WAKING on tasks that were
+	 * already set to waking, but make sure we are not changing
+	 * the relationship with a different lock.
+	 */
+	WARN_ON_ONCE(m && p->blocked_on != m && p->blocked_on != PROXY_WAKING);
+	p->blocked_on = PROXY_WAKING;
+}
+
+static inline void set_task_blocked_on_waking(struct task_struct *p, struct mutex *m)
+{
+	guard(raw_spinlock_irqsave)(&p->blocked_lock);
+	__set_task_blocked_on_waking(p, m);
+}
+
 #else
 static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
@@ -2228,6 +2267,14 @@ static inline void __clear_task_blocked_on(struct task_struct *p, struct rt_mute
 static inline void clear_task_blocked_on(struct task_struct *p, struct rt_mutex *m)
 {
 }
+
+static inline void __set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
+{
+}
+
+static inline void set_task_blocked_on_waking(struct task_struct *p, struct rt_mutex *m)
+{
+}
 #endif /* !CONFIG_PREEMPT_RT */
 
 static __always_inline bool need_resched(void)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 4aa79bc..7d35964 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -983,7 +983,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
 		next = waiter->task;
 
 		debug_mutex_wake_waiter(lock, waiter);
-		clear_task_blocked_on(next, lock);
+		set_task_blocked_on_waking(next, lock);
 		wake_q_add(&wake_q, next);
 	}
 
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index e4a8179..5cd9dfa 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -285,11 +285,11 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
 		/*
-		 * When waking up the task to die, be sure to clear the
-		 * blocked_on pointer. Otherwise we can see circular
-		 * blocked_on relationships that can't resolve.
+		 * When waking up the task to die, be sure to set the
+		 * blocked_on to PROXY_WAKING. Otherwise we can see
+		 * circular blocked_on relationships that can't resolve.
 		 */
-		clear_task_blocked_on(waiter->task, lock);
+		set_task_blocked_on_waking(waiter->task, lock);
 		wake_q_add(wake_q, waiter->task);
 	}
 
@@ -339,15 +339,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 */
 		if (owner != current) {
 			/*
-			 * When waking up the task to wound, be sure to clear the
-			 * blocked_on pointer. Otherwise we can see circular
-			 * blocked_on relationships that can't resolve.
+			 * When waking up the task to wound, be sure to set the
+			 * blocked_on to PROXY_WAKING. Otherwise we can see
+			 * circular blocked_on relationships that can't resolve.
 			 *
 			 * NOTE: We pass NULL here instead of lock, because we
 			 * are waking the mutex owner, who may be currently
 			 * blocked on a different mutex.
 			 */
-			clear_task_blocked_on(owner, NULL);
+			set_task_blocked_on_waking(owner, NULL);
 			wake_q_add(wake_q, owner);
 		}
 		return true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bf4338f..c997d51 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4239,6 +4239,13 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
+	/*
+	 * For now, if we've been woken up, clear the task->blocked_on
+	 * regardless if it was set to a mutex or PROXY_WAKING so the
+	 * task can run. We will need to be more careful later when
+	 * properly handling proxy migration
+	 */
+	clear_task_blocked_on(p, NULL);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6600,6 +6607,10 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 	/* Follow blocked_on chain. */
 	for (p = donor; (mutex = p->blocked_on); p = owner) {
+		/* if its PROXY_WAKING, resched_idle so ttwu can complete */
+		if (mutex == PROXY_WAKING)
+			return proxy_resched_idle(rq);
+
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
 		 * and ensure @owner sticks around.
@@ -6620,6 +6631,11 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 
 		owner = __mutex_owner(mutex);
 		if (!owner) {
+			/*
+			 * If there is no owner, clear blocked_on
+			 * and return p so it can run and try to
+			 * acquire the lock
+			 */
 			__clear_task_blocked_on(p, mutex);
 			return p;
 		}

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v26 07/10] sched: Add assert_balance_callbacks_empty helper
  2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
                   ` (5 preceding siblings ...)
  2026-03-24 19:13 ` [PATCH v26 06/10] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
@ 2026-03-24 19:13 ` John Stultz
  2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
  2026-03-24 19:13 ` [PATCH v26 08/10] sched: Add logic to zap balance callbacks if we pick again John Stultz
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 56+ messages in thread
From: John Stultz @ 2026-03-24 19:13 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Peter Zijlstra, K Prateek Nayak, Joel Fernandes,
	Qais Yousef, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

With proxy-exec utilizing pick-again logic, we can end up having
balance callbacks set by the preivous pick_next_task() call left
on the list.

So pull the warning out into a helper function, and make sure we
check it when we pick again.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v24:
* Use IS_ENABLED() as suggested by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c  | 1 +
 kernel/sched/sched.h | 9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aa2e7287235e3..b316b6015ffea 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6856,6 +6856,7 @@ static void __sched notrace __schedule(int sched_mode)
 	}
 
 pick_again:
+	assert_balance_callbacks_empty(rq);
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq->next_class = next->sched_class;
 	if (sched_proxy_exec()) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 43bbf0693cca4..2a0236d745832 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1853,6 +1853,13 @@ static inline void scx_rq_clock_update(struct rq *rq, u64 clock) {}
 static inline void scx_rq_clock_invalidate(struct rq *rq) {}
 #endif /* !CONFIG_SCHED_CLASS_EXT */
 
+static inline void assert_balance_callbacks_empty(struct rq *rq)
+{
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_PROVE_LOCKING) &&
+		     rq->balance_callback &&
+		     rq->balance_callback != &balance_push_callback);
+}
+
 /*
  * Lockdep annotation that avoids accidental unlocks; it's like a
  * sticky/continuous lockdep_assert_held().
@@ -1869,7 +1876,7 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
 	rf->clock_update_flags = 0;
-	WARN_ON_ONCE(rq->balance_callback && rq->balance_callback != &balance_push_callback);
+	assert_balance_callbacks_empty(rq);
 }
 
 static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)
-- 
2.53.0.1018.g2bb0e51243-goog


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched: Add assert_balance_callbacks_empty helper
  2026-03-24 19:13 ` [PATCH v26 07/10] sched: Add assert_balance_callbacks_empty helper John Stultz
@ 2026-04-03 12:30   ` tip-bot2 for John Stultz
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for John Stultz @ 2026-04-03 12:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra, John Stultz, K Prateek Nayak, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     f9530b3183358bbf945f7c20d4a6e2048061ec50
Gitweb:        https://git.kernel.org/tip/f9530b3183358bbf945f7c20d4a6e2048061ec50
Author:        John Stultz <jstultz@google.com>
AuthorDate:    Tue, 24 Mar 2026 19:13:22 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 03 Apr 2026 14:23:40 +02:00

sched: Add assert_balance_callbacks_empty helper

With proxy-exec utilizing pick-again logic, we can end up having
balance callbacks set by the preivous pick_next_task() call left
on the list.

So pull the warning out into a helper function, and make sure we
check it when we pick again.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-8-jstultz@google.com
---
 kernel/sched/core.c  |  1 +
 kernel/sched/sched.h |  9 ++++++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c997d51..acb5894 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6853,6 +6853,7 @@ static void __sched notrace __schedule(int sched_mode)
 	}
 
 pick_again:
+	assert_balance_callbacks_empty(rq);
 	next = pick_next_task(rq, rq->donor, &rf);
 	rq->next_class = next->sched_class;
 	if (sched_proxy_exec()) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b863bbd..a2629d0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1857,6 +1857,13 @@ static inline void scx_rq_clock_update(struct rq *rq, u64 clock) {}
 static inline void scx_rq_clock_invalidate(struct rq *rq) {}
 #endif /* !CONFIG_SCHED_CLASS_EXT */
 
+static inline void assert_balance_callbacks_empty(struct rq *rq)
+{
+	WARN_ON_ONCE(IS_ENABLED(CONFIG_PROVE_LOCKING) &&
+		     rq->balance_callback &&
+		     rq->balance_callback != &balance_push_callback);
+}
+
 /*
  * Lockdep annotation that avoids accidental unlocks; it's like a
  * sticky/continuous lockdep_assert_held().
@@ -1873,7 +1880,7 @@ static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
 
 	rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
 	rf->clock_update_flags = 0;
-	WARN_ON_ONCE(rq->balance_callback && rq->balance_callback != &balance_push_callback);
+	assert_balance_callbacks_empty(rq);
 }
 
 static inline void rq_unpin_lock(struct rq *rq, struct rq_flags *rf)

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v26 08/10] sched: Add logic to zap balance callbacks if we pick again
  2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
                   ` (6 preceding siblings ...)
  2026-03-24 19:13 ` [PATCH v26 07/10] sched: Add assert_balance_callbacks_empty helper John Stultz
@ 2026-03-24 19:13 ` John Stultz
  2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
  2026-03-24 19:13 ` [PATCH v26 09/10] sched: Move attach_one_task and attach_task helpers to sched.h John Stultz
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 56+ messages in thread
From: John Stultz @ 2026-03-24 19:13 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

With proxy-exec, a task is selected to run via pick_next_task(),
and then if it is a mutex blocked task, we call find_proxy_task()
to find a runnable owner. If the runnable owner is on another
cpu, we will need to migrate the selected donor task away, after
which we will pick_again can call pick_next_task() to choose
something else.

However, in the first call to pick_next_task(), we may have
had a balance_callback setup by the class scheduler. After we
pick again, its possible pick_next_task_fair() will be called
which calls sched_balance_newidle() and sched_balance_rq().

This will throw a warning:
[    8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback
[    8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250
...
[    8.796467] Call Trace:
[    8.796467]  <TASK>
[    8.796467]  ? __warn.cold+0xb2/0x14e
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  ? report_bug+0x107/0x1a0
[    8.796467]  ? handle_bug+0x54/0x90
[    8.796467]  ? exc_invalid_op+0x17/0x70
[    8.796467]  ? asm_exc_invalid_op+0x1a/0x20
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  sched_balance_newidle+0x295/0x820
[    8.796467]  pick_next_task_fair+0x51/0x3f0
[    8.796467]  __schedule+0x23a/0x14b0
[    8.796467]  ? lock_release+0x16d/0x2e0
[    8.796467]  schedule+0x3d/0x150
[    8.796467]  worker_thread+0xb5/0x350
[    8.796467]  ? __pfx_worker_thread+0x10/0x10
[    8.796467]  kthread+0xee/0x120
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork+0x31/0x50
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork_asm+0x1a/0x30
[    8.796467]  </TASK>

This is because if a RT task was originally picked, it will
setup the rq->balance_callback with push_rt_tasks() via
set_next_task_rt().

Once the task is migrated away and we pick again, we haven't
processed any balance callbacks, so rq->balance_callback is not
in the same state as it was the first time pick_next_task was
called.

To handle this, add a zap_balance_callbacks() helper function
which cleans up the balance callbacks without running them. This
should be ok, as we are effectively undoing the state set in
the first call to pick_next_task(), and when we pick again,
the new callback can be configured for the donor task actually
selected.

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v20:
* Tweaked to avoid build issues with different configs
v22:
* Spelling fix suggested by K Prateek
* Collapsed the stub implementation to one line as suggested
  by K Prateek
* Zap callbacks when we resched idle, as suggested by K Prateek
v24:
* Don't conditionalize function on CONFIG_SCHED_PROXY_EXEC as
  the callers will be optimized out if that is unset, and the
  dead function will be removed, as suggsted by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b316b6015ffea..4ed24ef590f73 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4920,6 +4920,34 @@ static inline void finish_task(struct task_struct *prev)
 	smp_store_release(&prev->on_cpu, 0);
 }
 
+/*
+ * Only called from __schedule context
+ *
+ * There are some cases where we are going to re-do the action
+ * that added the balance callbacks. We may not be in a state
+ * where we can run them, so just zap them so they can be
+ * properly re-added on the next time around. This is similar
+ * handling to running the callbacks, except we just don't call
+ * them.
+ */
+static void zap_balance_callbacks(struct rq *rq)
+{
+	struct balance_callback *next, *head;
+	bool found = false;
+
+	lockdep_assert_rq_held(rq);
+
+	head = rq->balance_callback;
+	while (head) {
+		if (head == &balance_push_callback)
+			found = true;
+		next = head->next;
+		head->next = NULL;
+		head = next;
+	}
+	rq->balance_callback = found ? &balance_push_callback : NULL;
+}
+
 static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
 {
 	void (*func)(struct rq *rq);
@@ -6865,10 +6893,14 @@ static void __sched notrace __schedule(int sched_mode)
 		rq_set_donor(rq, next);
 		if (unlikely(next->blocked_on)) {
 			next = find_proxy_task(rq, next, &rf);
-			if (!next)
+			if (!next) {
+				zap_balance_callbacks(rq);
 				goto pick_again;
-			if (next == rq->idle)
+			}
+			if (next == rq->idle) {
+				zap_balance_callbacks(rq);
 				goto keep_resched;
+			}
 		}
 		if (rq->donor == prev_donor && prev != next) {
 			struct task_struct *donor = rq->donor;
-- 
2.53.0.1018.g2bb0e51243-goog


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched: Add logic to zap balance callbacks if we pick again
  2026-03-24 19:13 ` [PATCH v26 08/10] sched: Add logic to zap balance callbacks if we pick again John Stultz
@ 2026-04-03 12:30   ` tip-bot2 for John Stultz
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for John Stultz @ 2026-04-03 12:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: John Stultz, Peter Zijlstra (Intel), K Prateek Nayak, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     48fda62de67a1e88fc8bada12caf0fc9b45116df
Gitweb:        https://git.kernel.org/tip/48fda62de67a1e88fc8bada12caf0fc9b45116df
Author:        John Stultz <jstultz@google.com>
AuthorDate:    Tue, 24 Mar 2026 19:13:23 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 03 Apr 2026 14:23:40 +02:00

sched: Add logic to zap balance callbacks if we pick again

With proxy-exec, a task is selected to run via pick_next_task(),
and then if it is a mutex blocked task, we call find_proxy_task()
to find a runnable owner. If the runnable owner is on another
cpu, we will need to migrate the selected donor task away, after
which we will pick_again can call pick_next_task() to choose
something else.

However, in the first call to pick_next_task(), we may have
had a balance_callback setup by the class scheduler. After we
pick again, its possible pick_next_task_fair() will be called
which calls sched_balance_newidle() and sched_balance_rq().

This will throw a warning:
[    8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback
[    8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250
...
[    8.796467] Call Trace:
[    8.796467]  <TASK>
[    8.796467]  ? __warn.cold+0xb2/0x14e
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  ? report_bug+0x107/0x1a0
[    8.796467]  ? handle_bug+0x54/0x90
[    8.796467]  ? exc_invalid_op+0x17/0x70
[    8.796467]  ? asm_exc_invalid_op+0x1a/0x20
[    8.796467]  ? sched_balance_rq+0xe92/0x1250
[    8.796467]  sched_balance_newidle+0x295/0x820
[    8.796467]  pick_next_task_fair+0x51/0x3f0
[    8.796467]  __schedule+0x23a/0x14b0
[    8.796467]  ? lock_release+0x16d/0x2e0
[    8.796467]  schedule+0x3d/0x150
[    8.796467]  worker_thread+0xb5/0x350
[    8.796467]  ? __pfx_worker_thread+0x10/0x10
[    8.796467]  kthread+0xee/0x120
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork+0x31/0x50
[    8.796467]  ? __pfx_kthread+0x10/0x10
[    8.796467]  ret_from_fork_asm+0x1a/0x30
[    8.796467]  </TASK>

This is because if a RT task was originally picked, it will
setup the rq->balance_callback with push_rt_tasks() via
set_next_task_rt().

Once the task is migrated away and we pick again, we haven't
processed any balance callbacks, so rq->balance_callback is not
in the same state as it was the first time pick_next_task was
called.

To handle this, add a zap_balance_callbacks() helper function
which cleans up the balance callbacks without running them. This
should be ok, as we are effectively undoing the state set in
the first call to pick_next_task(), and when we pick again,
the new callback can be configured for the donor task actually
selected.

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-9-jstultz@google.com
---
 kernel/sched/core.c | 36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index acb5894..162b24c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4917,6 +4917,34 @@ static inline void finish_task(struct task_struct *prev)
 	smp_store_release(&prev->on_cpu, 0);
 }
 
+/*
+ * Only called from __schedule context
+ *
+ * There are some cases where we are going to re-do the action
+ * that added the balance callbacks. We may not be in a state
+ * where we can run them, so just zap them so they can be
+ * properly re-added on the next time around. This is similar
+ * handling to running the callbacks, except we just don't call
+ * them.
+ */
+static void zap_balance_callbacks(struct rq *rq)
+{
+	struct balance_callback *next, *head;
+	bool found = false;
+
+	lockdep_assert_rq_held(rq);
+
+	head = rq->balance_callback;
+	while (head) {
+		if (head == &balance_push_callback)
+			found = true;
+		next = head->next;
+		head->next = NULL;
+		head = next;
+	}
+	rq->balance_callback = found ? &balance_push_callback : NULL;
+}
+
 static void do_balance_callbacks(struct rq *rq, struct balance_callback *head)
 {
 	void (*func)(struct rq *rq);
@@ -6862,10 +6890,14 @@ pick_again:
 		rq_set_donor(rq, next);
 		if (unlikely(next->blocked_on)) {
 			next = find_proxy_task(rq, next, &rf);
-			if (!next)
+			if (!next) {
+				zap_balance_callbacks(rq);
 				goto pick_again;
-			if (next == rq->idle)
+			}
+			if (next == rq->idle) {
+				zap_balance_callbacks(rq);
 				goto keep_resched;
+			}
 		}
 		if (rq->donor == prev_donor && prev != next) {
 			struct task_struct *donor = rq->donor;

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v26 09/10] sched: Move attach_one_task and attach_task helpers to sched.h
  2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
                   ` (7 preceding siblings ...)
  2026-03-24 19:13 ` [PATCH v26 08/10] sched: Add logic to zap balance callbacks if we pick again John Stultz
@ 2026-03-24 19:13 ` John Stultz
  2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
  2026-03-24 19:13 ` [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration) John Stultz
  2026-03-25 10:52 ` [PATCH v26 00/10] Simple Donor Migration for Proxy Execution K Prateek Nayak
  10 siblings, 1 reply; 56+ messages in thread
From: John Stultz @ 2026-03-24 19:13 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, K Prateek Nayak, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Valentin Schneider, Steven Rostedt, Ben Segall,
	Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

The fair scheduler locally introduced attach_one_task() and
attach_task() helpers, but these could be generically useful so
move this code to sched.h so we can use them elsewhere.

One minor tweak made to utilize guard(rq_lock)(rq) to simplifiy
the function.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v26:
* Folded in switch to use guard(rq_lock)(rq) as suggested
  by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/fair.c  | 26 --------------------------
 kernel/sched/sched.h | 23 +++++++++++++++++++++++
 2 files changed, 23 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf948db905ed1..53da01a251487 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9784,32 +9784,6 @@ static int detach_tasks(struct lb_env *env)
 	return detached;
 }
 
-/*
- * attach_task() -- attach the task detached by detach_task() to its new rq.
- */
-static void attach_task(struct rq *rq, struct task_struct *p)
-{
-	lockdep_assert_rq_held(rq);
-
-	WARN_ON_ONCE(task_rq(p) != rq);
-	activate_task(rq, p, ENQUEUE_NOCLOCK);
-	wakeup_preempt(rq, p, 0);
-}
-
-/*
- * attach_one_task() -- attaches the task returned from detach_one_task() to
- * its new rq.
- */
-static void attach_one_task(struct rq *rq, struct task_struct *p)
-{
-	struct rq_flags rf;
-
-	rq_lock(rq, &rf);
-	update_rq_clock(rq);
-	attach_task(rq, p);
-	rq_unlock(rq, &rf);
-}
-
 /*
  * attach_tasks() -- attaches all tasks detached by detach_tasks() to their
  * new rq.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2a0236d745832..d4def70df05a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3008,6 +3008,29 @@ extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
 
 extern void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags);
 
+/*
+ * attach_task() -- attach the task detached by detach_task() to its new rq.
+ */
+static inline void attach_task(struct rq *rq, struct task_struct *p)
+{
+	lockdep_assert_rq_held(rq);
+
+	WARN_ON_ONCE(task_rq(p) != rq);
+	activate_task(rq, p, ENQUEUE_NOCLOCK);
+	wakeup_preempt(rq, p, 0);
+}
+
+/*
+ * attach_one_task() -- attaches the task returned from detach_one_task() to
+ * its new rq.
+ */
+static inline void attach_one_task(struct rq *rq, struct task_struct *p)
+{
+	guard(rq_lock)(rq);
+	update_rq_clock(rq);
+	attach_task(rq, p);
+}
+
 #ifdef CONFIG_PREEMPT_RT
 # define SCHED_NR_MIGRATE_BREAK 8
 #else
-- 
2.53.0.1018.g2bb0e51243-goog


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched: Move attach_one_task and attach_task helpers to sched.h
  2026-03-24 19:13 ` [PATCH v26 09/10] sched: Move attach_one_task and attach_task helpers to sched.h John Stultz
@ 2026-04-03 12:30   ` tip-bot2 for John Stultz
  0 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for John Stultz @ 2026-04-03 12:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: K Prateek Nayak, John Stultz, Peter Zijlstra (Intel), x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     dec9554dc036183c715d02e9cfe48986d453427a
Gitweb:        https://git.kernel.org/tip/dec9554dc036183c715d02e9cfe48986d453427a
Author:        John Stultz <jstultz@google.com>
AuthorDate:    Tue, 24 Mar 2026 19:13:24 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 03 Apr 2026 14:23:40 +02:00

sched: Move attach_one_task and attach_task helpers to sched.h

The fair scheduler locally introduced attach_one_task() and
attach_task() helpers, but these could be generically useful so
move this code to sched.h so we can use them elsewhere.

One minor tweak made to utilize guard(rq_lock)(rq) to simplifiy
the function.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260324191337.1841376-10-jstultz@google.com
---
 kernel/sched/fair.c  | 26 --------------------------
 kernel/sched/sched.h | 23 +++++++++++++++++++++++
 2 files changed, 23 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f35dd4..41293d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9947,32 +9947,6 @@ next:
 }
 
 /*
- * attach_task() -- attach the task detached by detach_task() to its new rq.
- */
-static void attach_task(struct rq *rq, struct task_struct *p)
-{
-	lockdep_assert_rq_held(rq);
-
-	WARN_ON_ONCE(task_rq(p) != rq);
-	activate_task(rq, p, ENQUEUE_NOCLOCK);
-	wakeup_preempt(rq, p, 0);
-}
-
-/*
- * attach_one_task() -- attaches the task returned from detach_one_task() to
- * its new rq.
- */
-static void attach_one_task(struct rq *rq, struct task_struct *p)
-{
-	struct rq_flags rf;
-
-	rq_lock(rq, &rf);
-	update_rq_clock(rq);
-	attach_task(rq, p);
-	rq_unlock(rq, &rf);
-}
-
-/*
  * attach_tasks() -- attaches all tasks detached by detach_tasks() to their
  * new rq.
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a2629d0..9594355 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3012,6 +3012,29 @@ extern void deactivate_task(struct rq *rq, struct task_struct *p, int flags);
 
 extern void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags);
 
+/*
+ * attach_task() -- attach the task detached by detach_task() to its new rq.
+ */
+static inline void attach_task(struct rq *rq, struct task_struct *p)
+{
+	lockdep_assert_rq_held(rq);
+
+	WARN_ON_ONCE(task_rq(p) != rq);
+	activate_task(rq, p, ENQUEUE_NOCLOCK);
+	wakeup_preempt(rq, p, 0);
+}
+
+/*
+ * attach_one_task() -- attaches the task returned from detach_one_task() to
+ * its new rq.
+ */
+static inline void attach_one_task(struct rq *rq, struct task_struct *p)
+{
+	guard(rq_lock)(rq);
+	update_rq_clock(rq);
+	attach_task(rq, p);
+}
+
 #ifdef CONFIG_PREEMPT_RT
 # define SCHED_NR_MIGRATE_BREAK 8
 #else

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration)
  2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
                   ` (8 preceding siblings ...)
  2026-03-24 19:13 ` [PATCH v26 09/10] sched: Move attach_one_task and attach_task helpers to sched.h John Stultz
@ 2026-03-24 19:13 ` John Stultz
  2026-03-26 22:52   ` Steven Rostedt
                     ` (2 more replies)
  2026-03-25 10:52 ` [PATCH v26 00/10] Simple Donor Migration for Proxy Execution K Prateek Nayak
  10 siblings, 3 replies; 56+ messages in thread
From: John Stultz @ 2026-03-24 19:13 UTC (permalink / raw)
  To: LKML
  Cc: John Stultz, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Steven Rostedt, Ben Segall, Zimuzo Ezeozue,
	Mel Gorman, Will Deacon, Waiman Long, Boqun Feng,
	Paul E. McKenney, Metin Kaya, Xuewen Yan, K Prateek Nayak,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run, also
modify the scheduling classes to avoid sched class migrations on
mutex blocked tasks, leaving find_proxy_task() and related logic
to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
---
v6:
* Integrated sched_proxy_exec() check in proxy_return_migration()
* Minor cleanups to diff
* Unpin the rq before calling __balance_callbacks()
* Tweak proxy migrate to migrate deeper task in chain, to avoid
  tasks pingponging between rqs
v7:
* Fixup for unused function arguments
* Switch from that_rq -> target_rq, other minor tweaks, and typo
  fixes suggested by Metin Kaya
* Switch back to doing return migration in the ttwu path, which
  avoids nasty lock juggling and performance issues
* Fixes for UP builds
v8:
* More simplifications from Metin Kaya
* Fixes for null owner case, including doing return migration
* Cleanup proxy_needs_return logic
v9:
* Narrow logic in ttwu that sets BO_RUNNABLE, to avoid missed
  return migrations
* Switch to using zap_balance_callbacks rathern then running
  them when we are dropping rq locks for proxy_migration.
* Drop task_is_blocked check in sched_submit_work as suggested
  by Metin (may re-add later if this causes trouble)
* Do return migration when we're not on wake_cpu. This avoids
  bad task placement caused by proxy migrations raised by
  Xuewen Yan
* Fix to call set_next_task(rq->curr) prior to dropping rq lock
  to avoid rq->curr getting migrated before we have actually
  switched from it
* Cleanup to re-use proxy_resched_idle() instead of open coding
  it in proxy_migrate_task()
* Fix return migration not to use DEQUEUE_SLEEP, so that we
  properly see the task as task_on_rq_migrating() after it is
  dequeued but before set_task_cpu() has been called on it
* Fix to broaden find_proxy_task() checks to avoid race where
  a task is dequeued off the rq due to return migration, but
  set_task_cpu() and the enqueue on another rq happened after
  we checked task_cpu(owner). This ensures we don't proxy
  using a task that is not actually on our runqueue.
* Cleanup to avoid the locked BO_WAKING->BO_RUNNABLE transition
  in try_to_wake_up() if proxy execution isn't enabled.
* Cleanup to improve comment in proxy_migrate_task() explaining
  the set_next_task(rq->curr) logic
* Cleanup deadline.c change to stylistically match rt.c change
* Numerous cleanups suggested by Metin
v10:
* Drop WARN_ON(task_is_blocked(p)) in ttwu current case
v11:
* Include proxy_set_task_cpu from later in the series to this
  change so we can use it, rather then reworking logic later
  in the series.
* Fix problem with return migration, where affinity was changed
  and wake_cpu was left outside the affinity mask.
* Avoid reading the owner's cpu twice (as it might change inbetween)
  to avoid occasional migration-to-same-cpu edge cases
* Add extra WARN_ON checks for wake_cpu and return migration
  edge cases.
* Typo fix from Metin
v13:
* As we set ret, return it, not just NULL (pulling this change
  in from later patch)
* Avoid deadlock between try_to_wake_up() and find_proxy_task() when
  blocked_on cycle with ww_mutex is trying a mid-chain wakeup.
* Tweaks to use new __set_blocked_on_runnable() helper
* Potential fix for incorrectly updated task->dl_server issues
* Minor comment improvements
* Add logic to handle missed wakeups, in that case doing return
  migration from the find_proxy_task() path
* Minor cleanups
v14:
* Improve edge cases where we wouldn't set the task as BO_RUNNABLE
v15:
* Added comment to better describe proxy_needs_return() as suggested
  by Qais
* Build fixes for !CONFIG_SMP reported by
  Maciej Żenczykowski <maze@google.com>
* Adds fix for re-evaluating proxy_needs_return when
  sched_proxy_exec() is disabled, reported and diagnosed by:
  kuyo chang <kuyo.chang@mediatek.com>
v16:
* Larger rework of needs_return logic in find_proxy_task, in
  order to avoid problems with cpuhotplug
* Rework to use guard() as suggested by Peter
v18:
* Integrate optimization suggested by Suleiman to do the checks
  for sleeping owners before checking if the task_cpu is this_cpu,
  so that we can avoid needlessly proxy-migrating tasks to only
  then dequeue them. Also check if migrating last.
* Improve comments around guard locking
* Include tweak to ttwu_runnable() as suggested by
  hupu <hupu.gm@gmail.com>
* Rework the logic releasing the rq->donor reference before letting
  go of the rqlock. Just use rq->idle.
* Go back to doing return migration on BO_WAKING owners, as I was
  hitting some softlockups caused by running tasks not making
  it out of BO_WAKING.
v19:
* Fixed proxy_force_return() logic for !SMP cases
v21:
* Reworked donor deactivation for unhandled sleeping owners
* Commit message tweaks
v22:
* Add comments around zap_balance_callbacks in proxy_migration logic
* Rework logic to avoid gotos out of guard() scopes, and instead
  use break and switch() on action value, as suggested by K Prateek
* K Prateek suggested simplifications around putting donor and
  setting idle as next task in the migration paths, which I further
  simplified to using proxy_resched_idle()
* Comment improvements
* Dropped curr != donor check in pick_next_task_fair() suggested by
  K Prateek
v23:
* Rework to use the PROXY_WAKING approach suggested by Peter
* Drop unnecessarily setting wake_cpu when affinity changes
  as noticed by Peter
* Split out the ttwu() logic changes into a later separate patch
  as suggested by Peter
v24:
* Numerous fixes for rq clock handling, pointed out by K Prateek
* Slight tweak to where put_task() is called suggested by
  K Prateek
v25:
* Use WF_TTWU in proxy_force_return(), suggested by K Prateek
* Drop get/put_task_struct() in proxy_force_return(), suggested
  by K Prateek
* Use attach_one_task() to reduce repetitive logic, as suggested
  by K Prateek
v26:
* Add context analysis fixups suggested by Peter
* Add proxy_release/reacquire_rq_lock helpers suggested by Peter
* Rework comments as suggested by Peter
* Rework logic to use scoped_guard (task_rq_lock, p) suggested
  by Peter
* Move proxy_resched_idle() call up earlier before rq release
  in proxy_force_return() as suggested by K Prateek
* If needed, mark task PROXY_WAKING if try_to_block_task() fails
  due to a signal, as noted by K Prateek

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c | 225 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 197 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ed24ef590f73..49e4528450083 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3643,6 +3643,23 @@ void update_rq_avg_idle(struct rq *rq)
 	rq->idle_stamp = 0;
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
+{
+	unsigned int wake_cpu;
+
+	/*
+	 * Since we are enqueuing a blocked task on a cpu it may
+	 * not be able to run on, preserve wake_cpu when we
+	 * __set_task_cpu so we can return the task to where it
+	 * was previously runnable.
+	 */
+	wake_cpu = p->wake_cpu;
+	__set_task_cpu(p, cpu);
+	p->wake_cpu = wake_cpu;
+}
+#endif /* CONFIG_SCHED_PROXY_EXEC */
+
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		 struct rq_flags *rf)
@@ -4242,13 +4259,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
-	/*
-	 * For now, if we've been woken up, clear the task->blocked_on
-	 * regardless if it was set to a mutex or PROXY_WAKING so the
-	 * task can run. We will need to be more careful later when
-	 * properly handling proxy migration
-	 */
-	clear_task_blocked_on(p, NULL);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6533,6 +6543,8 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	if (signal_pending_state(task_state, p)) {
 		WRITE_ONCE(p->__state, TASK_RUNNING);
 		*task_state_p = TASK_RUNNING;
+		set_task_blocked_on_waking(p, NULL);
+
 		return false;
 	}
 
@@ -6578,7 +6590,7 @@ static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 	return rq->idle;
 }
 
-static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
@@ -6598,17 +6610,140 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	return try_to_block_task(rq, donor, &state, true);
 }
 
-static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
+	__releases(__rq_lockp(rq))
+{
+	/*
+	 * The class scheduler may have queued a balance callback
+	 * from pick_next_task() called earlier.
+	 *
+	 * So here we have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 *
+	 * After we later reaquire the rq lock, we will force __schedule()
+	 * to pick_again, so the callbacks will get re-established.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+}
+
+static inline void proxy_reacquire_rq_lock(struct rq *rq, struct rq_flags *rf)
+__acquires(__rq_lockp(rq))
+{
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
+}
+
+/*
+ * If the blocked-on relationship crosses CPUs, migrate @p to the
+ * owner's CPU.
+ *
+ * This is because we must respect the CPU affinity of execution
+ * contexts (owner) but we can ignore affinity for scheduling
+ * contexts (@p). So we have to move scheduling contexts towards
+ * potential execution contexts.
+ *
+ * Note: The owner can disappear, but simply migrate to @target_cpu
+ * and leave that CPU to sort things out.
+ */
+static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p, int target_cpu)
+	__must_hold(__rq_lockp(rq))
+{
+	struct rq *target_rq = cpu_rq(target_cpu);
+
+	lockdep_assert_rq_held(rq);
+	WARN_ON(p == rq->curr);
+	/*
+	 * Since we are migrating a blocked donor, it could be rq->donor,
+	 * and we want to make sure there aren't any references from this
+	 * rq to it before we drop the lock. This avoids another cpu
+	 * jumping in and grabbing the rq lock and referencing rq->donor
+	 * or cfs_rq->curr, etc after we have migrated it to another cpu,
+	 * and before we pick_again in __schedule.
+	 *
+	 * So call proxy_resched_idle() to drop the rq->donor references
+	 * before we release the lock.
+	 */
+	proxy_resched_idle(rq);
+
+	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+	proxy_set_task_cpu(p, target_cpu);
+
+	proxy_release_rq_lock(rq, rf);
+
+	attach_one_task(target_rq, p);
+
+	proxy_reacquire_rq_lock(rq, rf);
+}
+
+static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p)
+	__must_hold(__rq_lockp(rq))
 {
-	if (!__proxy_deactivate(rq, donor)) {
+	struct rq *task_rq, *target_rq = NULL;
+	int cpu, wake_flag = WF_TTWU;
+
+	lockdep_assert_rq_held(rq);
+	WARN_ON(p == rq->curr);
+
+	if (p == rq->donor)
+		proxy_resched_idle(rq);
+
+	proxy_release_rq_lock(rq, rf);
+	/*
+	 * We drop the rq lock, and re-grab task_rq_lock to get
+	 * the pi_lock (needed for select_task_rq) as well.
+	 */
+	scoped_guard (task_rq_lock, p) {
+		task_rq = scope.rq;
+
 		/*
-		 * XXX: For now, if deactivation failed, set donor
-		 * as unblocked, as we aren't doing proxy-migrations
-		 * yet (more logic will be needed then).
+		 * Since we let go of the rq lock, the task may have been
+		 * woken or migrated to another rq before we  got the
+		 * task_rq_lock. So re-check we're on the same RQ. If
+		 * not, the task has already been migrated and that CPU
+		 * will handle any futher migrations.
 		 */
-		clear_task_blocked_on(donor, NULL);
+		if (task_rq != rq)
+			break;
+
+		/*
+		 * Similarly, if we've been dequeued, someone else will
+		 * wake us
+		 */
+		if (!task_on_rq_queued(p))
+			break;
+
+		/*
+		 * Since we should only be calling here from __schedule()
+		 * -> find_proxy_task(), no one else should have
+		 * assigned current out from under us. But check and warn
+		 * if we see this, then bail.
+		 */
+		if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
+			WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
+				  __func__, cpu_of(task_rq),
+				  p->comm, p->pid, p->on_cpu);
+			break;
+		}
+
+		update_rq_clock(task_rq);
+		deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
+		cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
+		set_task_cpu(p, cpu);
+		target_rq = cpu_rq(cpu);
+		clear_task_blocked_on(p, NULL);
 	}
-	return NULL;
+
+	if (target_rq)
+		attach_one_task(target_rq, p);
+
+	proxy_reacquire_rq_lock(rq, rf);
 }
 
 /*
@@ -6629,18 +6764,27 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
  */
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+	__must_hold(__rq_lockp(rq))
 {
-	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
+	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
 	struct task_struct *owner = NULL;
+	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
 	struct mutex *mutex;
+	int owner_cpu;
 
 	/* Follow blocked_on chain. */
 	for (p = donor; (mutex = p->blocked_on); p = owner) {
-		/* if its PROXY_WAKING, resched_idle so ttwu can complete */
-		if (mutex == PROXY_WAKING)
-			return proxy_resched_idle(rq);
+		/* if its PROXY_WAKING, do return migration or run if current */
+		if (mutex == PROXY_WAKING) {
+			if (task_current(rq, p)) {
+				clear_task_blocked_on(p, PROXY_WAKING);
+				return p;
+			}
+			action = NEEDS_RETURN;
+			break;
+		}
 
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
@@ -6660,26 +6804,41 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			return NULL;
 		}
 
+		if (task_current(rq, p))
+			curr_in_chain = true;
+
 		owner = __mutex_owner(mutex);
 		if (!owner) {
 			/*
-			 * If there is no owner, clear blocked_on
-			 * and return p so it can run and try to
-			 * acquire the lock
+			 * If there is no owner, either clear blocked_on
+			 * and return p (if it is current and safe to
+			 * just run on this rq), or return-migrate the task.
 			 */
-			__clear_task_blocked_on(p, mutex);
-			return p;
+			if (task_current(rq, p)) {
+				__clear_task_blocked_on(p, NULL);
+				return p;
+			}
+			action = NEEDS_RETURN;
+			break;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
 			action = DEACTIVATE_DONOR;
 			break;
 		}
 
-		if (task_cpu(owner) != this_cpu) {
-			/* XXX Don't handle migrations yet */
-			action = DEACTIVATE_DONOR;
+		owner_cpu = task_cpu(owner);
+		if (owner_cpu != this_cpu) {
+			/*
+			 * @owner can disappear, simply migrate to @owner_cpu
+			 * and leave that CPU to sort things out.
+			 */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
+			action = MIGRATE;
 			break;
 		}
 
@@ -6741,7 +6900,17 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	/* Handle actions we need to do outside of the guard() scope */
 	switch (action) {
 	case DEACTIVATE_DONOR:
-		return proxy_deactivate(rq, donor);
+		if (proxy_deactivate(rq, donor))
+			return NULL;
+		/* If deactivate fails, force return */
+		p = donor;
+		fallthrough;
+	case NEEDS_RETURN:
+		proxy_force_return(rq, rf, p);
+		return NULL;
+	case MIGRATE:
+		proxy_migrate_task(rq, rf, p, owner_cpu);
+		return NULL;
 	case FOUND:
 		/* fallthrough */;
 	}
-- 
2.53.0.1018.g2bb0e51243-goog


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration)
  2026-03-24 19:13 ` [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration) John Stultz
@ 2026-03-26 22:52   ` Steven Rostedt
  2026-03-27  4:47     ` K Prateek Nayak
  2026-04-02 14:43   ` Peter Zijlstra
  2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
  2 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2026-03-26 22:52 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	K Prateek Nayak, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Tue, 24 Mar 2026 19:13:25 +0000
John Stultz <jstultz@google.com> wrote:

>  kernel/sched/core.c | 225 ++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 197 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4ed24ef590f73..49e4528450083 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3643,6 +3643,23 @@ void update_rq_avg_idle(struct rq *rq)
>  	rq->idle_stamp = 0;
>  }
>  
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
> +{
> +	unsigned int wake_cpu;
> +
> +	/*
> +	 * Since we are enqueuing a blocked task on a cpu it may
> +	 * not be able to run on, preserve wake_cpu when we
> +	 * __set_task_cpu so we can return the task to where it
> +	 * was previously runnable.
> +	 */
> +	wake_cpu = p->wake_cpu;
> +	__set_task_cpu(p, cpu);
> +	p->wake_cpu = wake_cpu;
> +}
> +#endif /* CONFIG_SCHED_PROXY_EXEC */

Hmm, this is only used in proxy_migrate_task() which is also within a
#ifdef CONFIG_SCHED_PROXY_EXEC block. Why did you put this function here
and create yet another #ifdef block with the same conditional?

Couldn't you just add it just before where it is used?

[..]

> +/*
> + * If the blocked-on relationship crosses CPUs, migrate @p to the
> + * owner's CPU.
> + *
> + * This is because we must respect the CPU affinity of execution
> + * contexts (owner) but we can ignore affinity for scheduling
> + * contexts (@p). So we have to move scheduling contexts towards
> + * potential execution contexts.
> + *
> + * Note: The owner can disappear, but simply migrate to @target_cpu
> + * and leave that CPU to sort things out.
> + */
> +static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
> +			       struct task_struct *p, int target_cpu)
> +	__must_hold(__rq_lockp(rq))
> +{
> +	struct rq *target_rq = cpu_rq(target_cpu);
> +
> +	lockdep_assert_rq_held(rq);
> +	WARN_ON(p == rq->curr);
> +	/*
> +	 * Since we are migrating a blocked donor, it could be rq->donor,
> +	 * and we want to make sure there aren't any references from this
> +	 * rq to it before we drop the lock. This avoids another cpu
> +	 * jumping in and grabbing the rq lock and referencing rq->donor
> +	 * or cfs_rq->curr, etc after we have migrated it to another cpu,
> +	 * and before we pick_again in __schedule.
> +	 *
> +	 * So call proxy_resched_idle() to drop the rq->donor references
> +	 * before we release the lock.
> +	 */
> +	proxy_resched_idle(rq);
> +
> +	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> +	proxy_set_task_cpu(p, target_cpu);
> +
> +	proxy_release_rq_lock(rq, rf);
> +
> +	attach_one_task(target_rq, p);
> +
> +	proxy_reacquire_rq_lock(rq, rf);
> +}

-- Steve

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration)
  2026-03-26 22:52   ` Steven Rostedt
@ 2026-03-27  4:47     ` K Prateek Nayak
  2026-03-27 12:47       ` Peter Zijlstra
  0 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-27  4:47 UTC (permalink / raw)
  To: Steven Rostedt, John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Ben Segall, Zimuzo Ezeozue, Mel Gorman, Will Deacon, Waiman Long,
	Boqun Feng, Paul E. McKenney, Metin Kaya, Xuewen Yan,
	Thomas Gleixner, Daniel Lezcano, Suleiman Souhlal, kuyo chang,
	hupu, kernel-team

Hello Steve,

On 3/27/2026 4:22 AM, Steven Rostedt wrote:
>> +#ifdef CONFIG_SCHED_PROXY_EXEC
>> +static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
>> +{
>> +	unsigned int wake_cpu;
>> +
>> +	/*
>> +	 * Since we are enqueuing a blocked task on a cpu it may
>> +	 * not be able to run on, preserve wake_cpu when we
>> +	 * __set_task_cpu so we can return the task to where it
>> +	 * was previously runnable.
>> +	 */
>> +	wake_cpu = p->wake_cpu;
>> +	__set_task_cpu(p, cpu);
>> +	p->wake_cpu = wake_cpu;
>> +}
>> +#endif /* CONFIG_SCHED_PROXY_EXEC */
> 
> Hmm, this is only used in proxy_migrate_task() which is also within a
> #ifdef CONFIG_SCHED_PROXY_EXEC block. Why did you put this function here
> and create yet another #ifdef block with the same conditional?
> 
> Couldn't you just add it just before where it is used?

If I have to take a guess, it is here because the full proxy stack makes
use of this for blocked owner bits and this was likely broken off from
that when it was one huge patch.

When activating the entire blocked_on chain when sleeping owner wakes
up, this is used for migrating the donors to the owner's CPU
(activate_blocked_waiters() on
johnstultz-work/linux-dev/proxy-exec-v26-7.0-rc5)

That needs to come before sched_ttwu_pending() which is just a few
functions down.

Maybe we can keep all the bits together for now and just use a
forward declaration later when those bits comes. Thoughts?

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration)
  2026-03-27  4:47     ` K Prateek Nayak
@ 2026-03-27 12:47       ` Peter Zijlstra
  0 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2026-03-27 12:47 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Steven Rostedt, John Stultz, LKML, Joel Fernandes, Qais Yousef,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Valentin Schneider, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Mar 27, 2026 at 10:17:49AM +0530, K Prateek Nayak wrote:

> Maybe we can keep all the bits together for now and just use a
> forward declaration later when those bits comes. Thoughts?

Yeah, I moved it for now. We can ponder later, later ;-)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration)
  2026-03-24 19:13 ` [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration) John Stultz
  2026-03-26 22:52   ` Steven Rostedt
@ 2026-04-02 14:43   ` Peter Zijlstra
  2026-04-02 15:08     ` Peter Zijlstra
  2026-04-02 17:34     ` John Stultz
  2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
  2 siblings, 2 replies; 56+ messages in thread
From: Peter Zijlstra @ 2026-04-02 14:43 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team


So with that other issue cured, I'm back to staring at this thing....

On Tue, Mar 24, 2026 at 07:13:25PM +0000, John Stultz wrote:
> +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
>  {
>  	unsigned long state = READ_ONCE(donor->__state);
>  
> @@ -6598,17 +6610,140 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
>  	return try_to_block_task(rq, donor, &state, true);
>  }
>  

> @@ -6741,7 +6900,17 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>  	/* Handle actions we need to do outside of the guard() scope */
>  	switch (action) {
>  	case DEACTIVATE_DONOR:
> -		return proxy_deactivate(rq, donor);
> +		if (proxy_deactivate(rq, donor))
> +			return NULL;
> +		/* If deactivate fails, force return */
> +		p = donor;
> +		fallthrough;

I was going to reply to Prateek's email and was going over the whole
ttwu path because of that, and that got me looking at this.

What happens here if donor is migrated; the current CPU no longer valid
and we fail proxy_deactivate() because of a pending signal?



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration)
  2026-04-02 14:43   ` Peter Zijlstra
@ 2026-04-02 15:08     ` Peter Zijlstra
  2026-04-02 17:43       ` John Stultz
  2026-04-02 17:34     ` John Stultz
  1 sibling, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2026-04-02 15:08 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Apr 02, 2026 at 04:43:02PM +0200, Peter Zijlstra wrote:
> 
> So with that other issue cured, I'm back to staring at this thing....
> 
> On Tue, Mar 24, 2026 at 07:13:25PM +0000, John Stultz wrote:
> > +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
> >  {
> >  	unsigned long state = READ_ONCE(donor->__state);
> >  
> > @@ -6598,17 +6610,140 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
> >  	return try_to_block_task(rq, donor, &state, true);
> >  }
> >  
> 
> > @@ -6741,7 +6900,17 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> >  	/* Handle actions we need to do outside of the guard() scope */
> >  	switch (action) {
> >  	case DEACTIVATE_DONOR:
> > -		return proxy_deactivate(rq, donor);
> > +		if (proxy_deactivate(rq, donor))
> > +			return NULL;
> > +		/* If deactivate fails, force return */
> > +		p = donor;
> > +		fallthrough;
> 
> I was going to reply to Prateek's email and was going over the whole
> ttwu path because of that, and that got me looking at this.
> 
> What happens here if donor is migrated; the current CPU no longer valid
> and we fail proxy_deactivate() because of a pending signal?

Something like so?

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2160,7 +2160,7 @@ void deactivate_task(struct rq *rq, stru
 	dequeue_task(rq, p, flags);
 }
 
-static void block_task(struct rq *rq, struct task_struct *p, int flags)
+static void _block_task(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
 		__block_task(rq, p);
@@ -6503,6 +6503,31 @@ pick_next_task(struct rq *rq, struct tas
 #define SM_PREEMPT		1
 #define SM_RTLOCK_WAIT		2
 
+static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
+{
+	int flags = DEQUEUE_NOCLOCK;
+
+	p->sched_contributes_to_load =
+		(task_state & TASK_UNINTERRUPTIBLE) &&
+		!(task_state & TASK_NOLOAD) &&
+		!(task_state & TASK_FROZEN);
+
+	if (unlikely(is_special_task_state(task_state)))
+		flags |= DEQUEUE_SPECIAL;
+
+	/*
+	 * __schedule()			ttwu()
+	 *   prev_state = prev->state;    if (p->on_rq && ...)
+	 *   if (prev_state)		    goto out;
+	 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
+	 *				  p->state = TASK_WAKING
+	 *
+	 * Where __schedule() and ttwu() have matching control dependencies.
+	 *
+	 * After this, schedule() must not care about p->state any more.
+	 */
+	_block_task(rq, p, flags);
+}
 /*
  * Helper function for __schedule()
  *
@@ -6515,7 +6540,6 @@ static bool try_to_block_task(struct rq
 			      unsigned long *task_state_p, bool should_block)
 {
 	unsigned long task_state = *task_state_p;
-	int flags = DEQUEUE_NOCLOCK;
 
 	if (signal_pending_state(task_state, p)) {
 		WRITE_ONCE(p->__state, TASK_RUNNING);
@@ -6535,26 +6559,7 @@ static bool try_to_block_task(struct rq
 	if (!should_block)
 		return false;
 
-	p->sched_contributes_to_load =
-		(task_state & TASK_UNINTERRUPTIBLE) &&
-		!(task_state & TASK_NOLOAD) &&
-		!(task_state & TASK_FROZEN);
-
-	if (unlikely(is_special_task_state(task_state)))
-		flags |= DEQUEUE_SPECIAL;
-
-	/*
-	 * __schedule()			ttwu()
-	 *   prev_state = prev->state;    if (p->on_rq && ...)
-	 *   if (prev_state)		    goto out;
-	 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
-	 *				  p->state = TASK_WAKING
-	 *
-	 * Where __schedule() and ttwu() have matching control dependencies.
-	 *
-	 * After this, schedule() must not care about p->state any more.
-	 */
-	block_task(rq, p, flags);
+	block_task(rq, p, task_state);
 	return true;
 }
 
@@ -6599,7 +6604,8 @@ static bool proxy_deactivate(struct rq *
 	 * need to be changed from next *before* we deactivate.
 	 */
 	proxy_resched_idle(rq);
-	return try_to_block_task(rq, donor, &state, true);
+	block_task(rq, donor, state);
+	return true;
 }
 
 static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration)
  2026-04-02 15:08     ` Peter Zijlstra
@ 2026-04-02 17:43       ` John Stultz
  0 siblings, 0 replies; 56+ messages in thread
From: John Stultz @ 2026-04-02 17:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Apr 2, 2026 at 8:08 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Apr 02, 2026 at 04:43:02PM +0200, Peter Zijlstra wrote:
> >
> > So with that other issue cured, I'm back to staring at this thing....
> >
> > On Tue, Mar 24, 2026 at 07:13:25PM +0000, John Stultz wrote:
> > > +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
> > >  {
> > >     unsigned long state = READ_ONCE(donor->__state);
> > >
> > > @@ -6598,17 +6610,140 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
> > >     return try_to_block_task(rq, donor, &state, true);
> > >  }
> > >
> >
> > > @@ -6741,7 +6900,17 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> > >     /* Handle actions we need to do outside of the guard() scope */
> > >     switch (action) {
> > >     case DEACTIVATE_DONOR:
> > > -           return proxy_deactivate(rq, donor);
> > > +           if (proxy_deactivate(rq, donor))
> > > +                   return NULL;
> > > +           /* If deactivate fails, force return */
> > > +           p = donor;
> > > +           fallthrough;
> >
> > I was going to reply to Prateek's email and was going over the whole
> > ttwu path because of that, and that got me looking at this.
> >
> > What happens here if donor is migrated; the current CPU no longer valid
> > and we fail proxy_deactivate() because of a pending signal?
>
> Something like so?
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6515,7 +6540,6 @@ static bool try_to_block_task(struct rq
>                               unsigned long *task_state_p, bool should_block)
>  {
>         unsigned long task_state = *task_state_p;
> -       int flags = DEQUEUE_NOCLOCK;
>
>         if (signal_pending_state(task_state, p)) {
>                 WRITE_ONCE(p->__state, TASK_RUNNING);
...
> @@ -6599,7 +6604,8 @@ static bool proxy_deactivate(struct rq *
>          * need to be changed from next *before* we deactivate.
>          */
>         proxy_resched_idle(rq);
> -       return try_to_block_task(rq, donor, &state, true);
> +       block_task(rq, donor, state);
> +       return true;
>  }
>
>  static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)

So, it seems like this would just causes us to ignore the pending
signal(and just do the deactivation)? I'm not sure why that's
desired...

thanks
-john

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration)
  2026-04-02 14:43   ` Peter Zijlstra
  2026-04-02 15:08     ` Peter Zijlstra
@ 2026-04-02 17:34     ` John Stultz
  1 sibling, 0 replies; 56+ messages in thread
From: John Stultz @ 2026-04-02 17:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, K Prateek Nayak, Thomas Gleixner,
	Daniel Lezcano, Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Apr 2, 2026 at 7:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> So with that other issue cured, I'm back to staring at this thing....
>
> On Tue, Mar 24, 2026 at 07:13:25PM +0000, John Stultz wrote:
> > +static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
> >  {
> >       unsigned long state = READ_ONCE(donor->__state);
> >
> > @@ -6598,17 +6610,140 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
> >       return try_to_block_task(rq, donor, &state, true);
> >  }
> >
>
> > @@ -6741,7 +6900,17 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
> >       /* Handle actions we need to do outside of the guard() scope */
> >       switch (action) {
> >       case DEACTIVATE_DONOR:
> > -             return proxy_deactivate(rq, donor);
> > +             if (proxy_deactivate(rq, donor))
> > +                     return NULL;
> > +             /* If deactivate fails, force return */
> > +             p = donor;
> > +             fallthrough;
>
> I was going to reply to Prateek's email and was going over the whole
> ttwu path because of that, and that got me looking at this.
>
> What happens here if donor is migrated; the current CPU no longer valid
> and we fail proxy_deactivate() because of a pending signal?
>

You hit the fallthrough line you quoted above, which puts you into:

force_return:
        proxy_force_return(rq, rf, p);
        return NULL;

Which will return the donor to a CPU it can run on.

Can you clarify more about the case you're worried about?

thanks
-john

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [tip: sched/core] sched: Handle blocked-waiter migration (and return migration)
  2026-03-24 19:13 ` [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration) John Stultz
  2026-03-26 22:52   ` Steven Rostedt
  2026-04-02 14:43   ` Peter Zijlstra
@ 2026-04-03 12:30   ` tip-bot2 for John Stultz
  2 siblings, 0 replies; 56+ messages in thread
From: tip-bot2 for John Stultz @ 2026-04-03 12:30 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: John Stultz, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     b049b81bdff6fc6794200a4c7d7d910e2008d57f
Gitweb:        https://git.kernel.org/tip/b049b81bdff6fc6794200a4c7d7d910e2008d57f
Author:        John Stultz <jstultz@google.com>
AuthorDate:    Tue, 24 Mar 2026 19:13:25 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 03 Apr 2026 14:23:41 +02:00

sched: Handle blocked-waiter migration (and return migration)

Add logic to handle migrating a blocked waiter to a remote
cpu where the lock owner is runnable.

Additionally, as the blocked task may not be able to run
on the remote cpu, add logic to handle return migration once
the waiting task is given the mutex.

Because tasks may get migrated to where they cannot run, also
modify the scheduling classes to avoid sched class migrations on
mutex blocked tasks, leaving find_proxy_task() and related logic
to do the migrations and return migrations.

This was split out from the larger proxy patch, and
significantly reworked.

Credits for the original patch go to:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260324191337.1841376-11-jstultz@google.com
---
 kernel/sched/core.c | 232 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 194 insertions(+), 38 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 162b24c..c15c986 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4239,13 +4239,6 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		ttwu_queue(p, cpu, wake_flags);
 	}
 out:
-	/*
-	 * For now, if we've been woken up, clear the task->blocked_on
-	 * regardless if it was set to a mutex or PROXY_WAKING so the
-	 * task can run. We will need to be more careful later when
-	 * properly handling proxy migration
-	 */
-	clear_task_blocked_on(p, NULL);
 	if (success)
 		ttwu_stat(p, task_cpu(p), wake_flags);
 
@@ -6530,6 +6523,8 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	if (signal_pending_state(task_state, p)) {
 		WRITE_ONCE(p->__state, TASK_RUNNING);
 		*task_state_p = TASK_RUNNING;
+		set_task_blocked_on_waking(p, NULL);
+
 		return false;
 	}
 
@@ -6567,6 +6562,21 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 }
 
 #ifdef CONFIG_SCHED_PROXY_EXEC
+static inline void proxy_set_task_cpu(struct task_struct *p, int cpu)
+{
+	unsigned int wake_cpu;
+
+	/*
+	 * Since we are enqueuing a blocked task on a cpu it may
+	 * not be able to run on, preserve wake_cpu when we
+	 * __set_task_cpu so we can return the task to where it
+	 * was previously runnable.
+	 */
+	wake_cpu = p->wake_cpu;
+	__set_task_cpu(p, cpu);
+	p->wake_cpu = wake_cpu;
+}
+
 static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 {
 	put_prev_set_next_task(rq, rq->donor, rq->idle);
@@ -6575,7 +6585,7 @@ static inline struct task_struct *proxy_resched_idle(struct rq *rq)
 	return rq->idle;
 }
 
-static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
@@ -6595,17 +6605,140 @@ static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
 	return try_to_block_task(rq, donor, &state, true);
 }
 
-static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
+	__releases(__rq_lockp(rq))
+{
+	/*
+	 * The class scheduler may have queued a balance callback
+	 * from pick_next_task() called earlier.
+	 *
+	 * So here we have to zap callbacks before unlocking the rq
+	 * as another CPU may jump in and call sched_balance_rq
+	 * which can trip the warning in rq_pin_lock() if we
+	 * leave callbacks set.
+	 *
+	 * After we later reaquire the rq lock, we will force __schedule()
+	 * to pick_again, so the callbacks will get re-established.
+	 */
+	zap_balance_callbacks(rq);
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+}
+
+static inline void proxy_reacquire_rq_lock(struct rq *rq, struct rq_flags *rf)
+	__acquires(__rq_lockp(rq))
+{
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+	update_rq_clock(rq);
+}
+
+/*
+ * If the blocked-on relationship crosses CPUs, migrate @p to the
+ * owner's CPU.
+ *
+ * This is because we must respect the CPU affinity of execution
+ * contexts (owner) but we can ignore affinity for scheduling
+ * contexts (@p). So we have to move scheduling contexts towards
+ * potential execution contexts.
+ *
+ * Note: The owner can disappear, but simply migrate to @target_cpu
+ * and leave that CPU to sort things out.
+ */
+static void proxy_migrate_task(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p, int target_cpu)
+	__must_hold(__rq_lockp(rq))
+{
+	struct rq *target_rq = cpu_rq(target_cpu);
+
+	lockdep_assert_rq_held(rq);
+	WARN_ON(p == rq->curr);
+	/*
+	 * Since we are migrating a blocked donor, it could be rq->donor,
+	 * and we want to make sure there aren't any references from this
+	 * rq to it before we drop the lock. This avoids another cpu
+	 * jumping in and grabbing the rq lock and referencing rq->donor
+	 * or cfs_rq->curr, etc after we have migrated it to another cpu,
+	 * and before we pick_again in __schedule.
+	 *
+	 * So call proxy_resched_idle() to drop the rq->donor references
+	 * before we release the lock.
+	 */
+	proxy_resched_idle(rq);
+
+	deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+	proxy_set_task_cpu(p, target_cpu);
+
+	proxy_release_rq_lock(rq, rf);
+
+	attach_one_task(target_rq, p);
+
+	proxy_reacquire_rq_lock(rq, rf);
+}
+
+static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
+			       struct task_struct *p)
+	__must_hold(__rq_lockp(rq))
 {
-	if (!__proxy_deactivate(rq, donor)) {
+	struct rq *task_rq, *target_rq = NULL;
+	int cpu, wake_flag = WF_TTWU;
+
+	lockdep_assert_rq_held(rq);
+	WARN_ON(p == rq->curr);
+
+	if (p == rq->donor)
+		proxy_resched_idle(rq);
+
+	proxy_release_rq_lock(rq, rf);
+	/*
+	 * We drop the rq lock, and re-grab task_rq_lock to get
+	 * the pi_lock (needed for select_task_rq) as well.
+	 */
+	scoped_guard (task_rq_lock, p) {
+		task_rq = scope.rq;
+
 		/*
-		 * XXX: For now, if deactivation failed, set donor
-		 * as unblocked, as we aren't doing proxy-migrations
-		 * yet (more logic will be needed then).
+		 * Since we let go of the rq lock, the task may have been
+		 * woken or migrated to another rq before we  got the
+		 * task_rq_lock. So re-check we're on the same RQ. If
+		 * not, the task has already been migrated and that CPU
+		 * will handle any futher migrations.
 		 */
-		clear_task_blocked_on(donor, NULL);
+		if (task_rq != rq)
+			break;
+
+		/*
+		 * Similarly, if we've been dequeued, someone else will
+		 * wake us
+		 */
+		if (!task_on_rq_queued(p))
+			break;
+
+		/*
+		 * Since we should only be calling here from __schedule()
+		 * -> find_proxy_task(), no one else should have
+		 * assigned current out from under us. But check and warn
+		 * if we see this, then bail.
+		 */
+		if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
+			WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
+				  __func__, cpu_of(task_rq),
+				  p->comm, p->pid, p->on_cpu);
+			break;
+		}
+
+		update_rq_clock(task_rq);
+		deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
+		cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
+		set_task_cpu(p, cpu);
+		target_rq = cpu_rq(cpu);
+		clear_task_blocked_on(p, NULL);
 	}
-	return NULL;
+
+	if (target_rq)
+		attach_one_task(target_rq, p);
+
+	proxy_reacquire_rq_lock(rq, rf);
 }
 
 /*
@@ -6626,18 +6759,25 @@ static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *d
  */
 static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+	__must_hold(__rq_lockp(rq))
 {
-	enum { FOUND, DEACTIVATE_DONOR } action = FOUND;
 	struct task_struct *owner = NULL;
+	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
 	struct task_struct *p;
 	struct mutex *mutex;
+	int owner_cpu;
 
 	/* Follow blocked_on chain. */
 	for (p = donor; (mutex = p->blocked_on); p = owner) {
-		/* if its PROXY_WAKING, resched_idle so ttwu can complete */
-		if (mutex == PROXY_WAKING)
-			return proxy_resched_idle(rq);
+		/* if its PROXY_WAKING, do return migration or run if current */
+		if (mutex == PROXY_WAKING) {
+			if (task_current(rq, p)) {
+				clear_task_blocked_on(p, PROXY_WAKING);
+				return p;
+			}
+			goto force_return;
+		}
 
 		/*
 		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
@@ -6657,27 +6797,39 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 			return NULL;
 		}
 
+		if (task_current(rq, p))
+			curr_in_chain = true;
+
 		owner = __mutex_owner(mutex);
 		if (!owner) {
 			/*
-			 * If there is no owner, clear blocked_on
-			 * and return p so it can run and try to
-			 * acquire the lock
+			 * If there is no owner, either clear blocked_on
+			 * and return p (if it is current and safe to
+			 * just run on this rq), or return-migrate the task.
 			 */
-			__clear_task_blocked_on(p, mutex);
-			return p;
+			if (task_current(rq, p)) {
+				__clear_task_blocked_on(p, NULL);
+				return p;
+			}
+			goto force_return;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
-			action = DEACTIVATE_DONOR;
-			break;
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
+			goto deactivate;
 		}
 
-		if (task_cpu(owner) != this_cpu) {
-			/* XXX Don't handle migrations yet */
-			action = DEACTIVATE_DONOR;
-			break;
+		owner_cpu = task_cpu(owner);
+		if (owner_cpu != this_cpu) {
+			/*
+			 * @owner can disappear, simply migrate to @owner_cpu
+			 * and leave that CPU to sort things out.
+			 */
+			if (curr_in_chain)
+				return proxy_resched_idle(rq);
+			goto migrate_task;
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6734,16 +6886,20 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 		 * guarantee its existence, as per ttwu_remote().
 		 */
 	}
-
-	/* Handle actions we need to do outside of the guard() scope */
-	switch (action) {
-	case DEACTIVATE_DONOR:
-		return proxy_deactivate(rq, donor);
-	case FOUND:
-		/* fallthrough */;
-	}
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
+
+deactivate:
+	if (proxy_deactivate(rq, donor))
+		return NULL;
+	/* If deactivate fails, force return */
+	p = donor;
+force_return:
+	proxy_force_return(rq, rf, p);
+	return NULL;
+migrate_task:
+	proxy_migrate_task(rq, rf, p, owner_cpu);
+	return NULL;
 }
 #else /* SCHED_PROXY_EXEC */
 static struct task_struct *

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
                   ` (9 preceding siblings ...)
  2026-03-24 19:13 ` [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration) John Stultz
@ 2026-03-25 10:52 ` K Prateek Nayak
  2026-03-27 11:48   ` Peter Zijlstra
  2026-03-27 19:10   ` John Stultz
  10 siblings, 2 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-25 10:52 UTC (permalink / raw)
  To: John Stultz, LKML
  Cc: Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 3/25/2026 12:43 AM, John Stultz wrote:
> I really want to share my appreciation for feedback provided by
> Peter, K Prateek and Juri on the last revision! 

And we really appreciate you working on this! (Cannot state this
enough)

> There’s also been some further improvements In the full Proxy
> Execution series:
> * Tweaks to proxy_needs_return() suggested by K Prateek

To answer your question on v25, I finally seem to have
ttwu_state_match() happy with the pieces in:
https://github.com/kudureranganath/linux/commits/kudure/sched/proxy/ttwu_state_match/

The base rationale is still the same from
https://lore.kernel.org/lkml/eccf9bb5-8455-48e5-aa35-4878c25f6822@amd.com/

tl;dr

Use rq_lock() to serialize clearing of "p->blocked_on". All the other
transition change "p->blocked_on" non-NULL values. Exploit this to use
ttwu_state_match() + ttwu_runnable() in our favor when waking up blocked
donors and handling their return migration.

These are the commits of interest on the tree with a brief explanation:

I added back proxy_reset_donor() for the sake of testing now that a
bunch of other bits are addressed in v26. I've mostly been testing at
the below commit (this time with LOCKDEP enabled):

    82a29c2ecd4b sched/core: Reset the donor to current task when donor is woken
    ...
    5dc4507b1f04 [ANNOTATION] === proxy donor/blocked-waiter migration before this point  ===

Above range which has you as the author has not been touched - same as
what you have on your proxy-exec-v26-7.0-rc5 branch.

I did not tackle sleeping owner bits yet because there are too many
locks, lists, synchronization nuances that I still need to wrap my
head around. That said ...

The below is making ttwu_state_match() sufficient enough to handle
the return migration which allows for using wake_up_process(). The
patches are small and the major ones should have enough rationale
in the comments and the commit message to justify the changes.

  0b3810f43c66 sched/core: Simplify proxy_force_return()
  609c41b77eaf sched/core: Remove proxy_task_runnable_but_waking()
  157721338332 sched/core: Prepare proxy_deactivate() to comply with ttwu state machinery
  abefa0729920 sched/core: Allow callers of try_to_block_task() to handle "blocked_on" relation

Only change to below was conflict resolution as a result of some
re-arrangement.

  787b078b588f sched: Handle blocked-waiter migration (and return migration)

These are few changes to proxy_needs_return() exploiting the idea
of "p->blocked_on" being only cleared under rq_lock:

  84a2b581dfe8 sched/core: Remove "p->wake_cpu" constraint in proxy_needs_return()
  c52d51d67452 sched/core: Handle "blocked_on" clearing for wakeups in ttwu_runnable()

I just moved this further because I think it is an important bit to
handle the return migration vs wakeup of blocked donor. This too
has only been modified to resolve conflicts and nothing more.

  9db85fb35c22 sched: Have try_to_wake_up() handle return-migration for PROXY_WAKING case

These are two small trivial fixes - one that already exists in your
tree and is required for using proxy_resched_idle() from
proxy_needs_return() and the other to clear "p->blocked_on" when a
wakeup races with __schedule():

  0d6a01bb19db sched/core: Clear "blocked_on" relation if schedule races with wakeup
  fd60c48f7b71 sched: Avoid donor->sched_class->yield_task() null traversal

Everything before this is same as what is in your tree.

The bottom ones have the most information and the commit messages
get brief as we move to top but I believe there is enough context
in comments + commit log to justify these changes. Some may
actually have too much context but I've dumped my head out.

I'll freeze this branch and use a WIP like yours if and when I
manage to crash and burn these bits.

I know you are already oversubscribed so please take a look on a
best effort basis. I can also resend this as a separate series
once v26 lands if there is enough interest around.

Sorry for the dump and thank you again for patiently working on
this. Much appreciated _/\_

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-03-25 10:52 ` [PATCH v26 00/10] Simple Donor Migration for Proxy Execution K Prateek Nayak
@ 2026-03-27 11:48   ` Peter Zijlstra
  2026-03-27 13:33     ` K Prateek Nayak
  2026-03-27 19:15     ` John Stultz
  2026-03-27 19:10   ` John Stultz
  1 sibling, 2 replies; 56+ messages in thread
From: Peter Zijlstra @ 2026-03-27 11:48 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Mar 25, 2026 at 04:22:14PM +0530, K Prateek Nayak wrote:

I tried to have a quick look, but I find it *very* hard to make sense of
the differences.

(could be I just don't know how to operate github -- that seems a
recurrent theme)

Anyway, this:

>   fd60c48f7b71 sched: Avoid donor->sched_class->yield_task() null traversal

That seems *very* dodgy indeed. Exposing idle as the donor seems ... wrong?


Anyway, you seem to want to drive the return migration from the regular
wakeup path and I don't mind doing that, provided it isn't horrible. But
we can do this on top of these patches, right?

That is, I'm thinking of taking these patches, they're in reasonable
shape, and John deserves a little progress :-)

I did find myself liking the below a little better, but I'll just sneak
that in.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6822,7 +6822,6 @@ static struct task_struct *
 find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
 	__must_hold(__rq_lockp(rq))
 {
-	enum { FOUND, DEACTIVATE_DONOR, MIGRATE, NEEDS_RETURN } action = FOUND;
 	struct task_struct *owner = NULL;
 	bool curr_in_chain = false;
 	int this_cpu = cpu_of(rq);
@@ -6838,8 +6837,7 @@ find_proxy_task(struct rq *rq, struct ta
 				clear_task_blocked_on(p, PROXY_WAKING);
 				return p;
 			}
-			action = NEEDS_RETURN;
-			break;
+			goto force_return;
 		}
 
 		/*
@@ -6874,16 +6872,14 @@ find_proxy_task(struct rq *rq, struct ta
 				__clear_task_blocked_on(p, NULL);
 				return p;
 			}
-			action = NEEDS_RETURN;
-			break;
+			goto force_return;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
 			/* XXX Don't handle blocked owners/delayed dequeue yet */
 			if (curr_in_chain)
 				return proxy_resched_idle(rq);
-			action = DEACTIVATE_DONOR;
-			break;
+			goto deactivate;
 		}
 
 		owner_cpu = task_cpu(owner);
@@ -6894,8 +6890,7 @@ find_proxy_task(struct rq *rq, struct ta
 			 */
 			if (curr_in_chain)
 				return proxy_resched_idle(rq);
-			action = MIGRATE;
-			break;
+			goto migrate_task;
 		}
 
 		if (task_on_rq_migrating(owner)) {
@@ -6952,26 +6947,20 @@ find_proxy_task(struct rq *rq, struct ta
 		 * guarantee its existence, as per ttwu_remote().
 		 */
 	}
-
-	/* Handle actions we need to do outside of the guard() scope */
-	switch (action) {
-	case DEACTIVATE_DONOR:
-		if (proxy_deactivate(rq, donor))
-			return NULL;
-		/* If deactivate fails, force return */
-		p = donor;
-		fallthrough;
-	case NEEDS_RETURN:
-		proxy_force_return(rq, rf, p);
-		return NULL;
-	case MIGRATE:
-		proxy_migrate_task(rq, rf, p, owner_cpu);
-		return NULL;
-	case FOUND:
-		/* fallthrough */;
-	}
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
+
+deactivate:
+	if (proxy_deactivate(rq, donor))
+		return NULL;
+	/* If deactivate fails, force return */
+	p = donor;
+force_return:
+	proxy_force_return(rq, rf, p);
+	return NULL;
+migrate_task:
+	proxy_migrate_task(rq, rf, p, owner_cpu);
+	return NULL;
 }
 #else /* SCHED_PROXY_EXEC */
 static struct task_struct *

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-03-27 11:48   ` Peter Zijlstra
@ 2026-03-27 13:33     ` K Prateek Nayak
  2026-03-27 15:20       ` Peter Zijlstra
  2026-03-27 16:00       ` Peter Zijlstra
  2026-03-27 19:15     ` John Stultz
  1 sibling, 2 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-27 13:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello Peter,

On 3/27/2026 5:18 PM, Peter Zijlstra wrote:
> I tried to have a quick look, but I find it *very* hard to make sense of
> the differences.

Couple of concerns I had with the current approach is:

1. Why can't we simply do block_task() + wake_up_process() for return
   migration?

2. Why does proxy_needs_return() (this comes later in John's tree but I
   moved it up ahead)  need the proxy_task_runnable_but_waking() override
   of the ttwu_state_mach() machinery?
   (https://github.com/johnstultz-work/linux-dev/commit/28ad4d3fa847b90713ca18a623d1ee7f73b648d9)

3. How can proxy_deactivate() see a TASK_RUNNING for blocked donor?

So I went back after my discussion with John at LPC to see if the 
ttwu_state_match() stuff can be left alone and I sent out that
incomprehensible diff on v24.

Then I put a tree where my mouth is to give better rationale behind
each small hunk that was mostly in my head until then. Voila, another
(slightly less) incomprehensible set of bite sized changes :-)

> 
> Anyway, this:
> 
>>   fd60c48f7b71 sched: Avoid donor->sched_class->yield_task() null traversal
> 
> That seems *very* dodgy indeed. Exposing idle as the donor seems ... wrong?

That should get fixed by
https://github.com/kudureranganath/linux/commit/82a29c2ecd4b5f8eb082bb6a4a647aa16a2850be

John has mentioned hitting some warnings a while back from that
https://lore.kernel.org/lkml/f5bc87a7-390f-4e68-95b0-10cab2b92caf@amd.com/

Since v26 does proxy_resched_idle() before doing
proxy_release_rq_lock() in proxy_force_return(), that shouldn't be a
problem.

Speaking of that commit, I would like you or Juri to confirm if it is
okay to set a throttled deadline task as rq->donor for a while until it
hits resched.

> 
> 
> Anyway, you seem to want to drive the return migration from the regular
> wakeup path and I don't mind doing that, provided it isn't horrible. But
> we can do this on top of these patches, right?
> 
> That is, I'm thinking of taking these patches, they're in reasonable
> shape, and John deserves a little progress :-)
> 
> I did find myself liking the below a little better, but I'll just sneak
> that in.

So John originally had that and then I saw Dan's comment in
cleanup.h that reads:

  Lastly, given that the benefit of cleanup helpers is removal of
  "goto", and that the "goto" statement can jump between scopes, the
  expectation is that usage of "goto" and cleanup helpers is never
  mixed in the same function. I.e. for a given routine, convert all
  resources that need a "goto" cleanup to scope-based cleanup, or
  convert none of them.

which can either be interpreted as "Don't do it unless you know what
you are doing" or "There is at least one compiler that will get a
goto + cleanup guard wrong" and to err on side of caution, I
suggested we do break + enums.

If there are no concerns, then the suggested diff is indeed much
better.

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-03-27 13:33     ` K Prateek Nayak
@ 2026-03-27 15:20       ` Peter Zijlstra
  2026-03-27 15:41         ` Peter Zijlstra
  2026-03-27 16:00       ` Peter Zijlstra
  1 sibling, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2026-03-27 15:20 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Mar 27, 2026 at 07:03:19PM +0530, K Prateek Nayak wrote:

> So John originally had that and then I saw Dan's comment in
> cleanup.h that reads:
> 
>   Lastly, given that the benefit of cleanup helpers is removal of
>   "goto", and that the "goto" statement can jump between scopes, the
>   expectation is that usage of "goto" and cleanup helpers is never
>   mixed in the same function. I.e. for a given routine, convert all
>   resources that need a "goto" cleanup to scope-based cleanup, or
>   convert none of them.
> 
> which can either be interpreted as "Don't do it unless you know what
> you are doing" or "There is at least one compiler that will get a
> goto + cleanup guard wrong" and to err on side of caution, I
> suggested we do break + enums.
> 
> If there are no concerns, then the suggested diff is indeed much
> better.

IIRC the concern was doing partial error handling conversions and
getting it hopelessly wrong.

And while some GCC's generate wrong code when you goto into a guard, all
clangs ever will error on that, so any such code should not survive the
robots.

And then there was an issue with computed gotos and asm_goto, but I the
former are exceedingly rare (and again, clang will error IIRC) and the
latter we upped the minimum clang version for.

Anyway, there is nothing inherently wrong with using goto to exit a
scope and it works well.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-03-27 15:20       ` Peter Zijlstra
@ 2026-03-27 15:41         ` Peter Zijlstra
  0 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2026-03-27 15:41 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Mar 27, 2026 at 04:20:52PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 27, 2026 at 07:03:19PM +0530, K Prateek Nayak wrote:
> 
> > So John originally had that and then I saw Dan's comment in
> > cleanup.h that reads:
> > 
> >   Lastly, given that the benefit of cleanup helpers is removal of
> >   "goto", and that the "goto" statement can jump between scopes, the
> >   expectation is that usage of "goto" and cleanup helpers is never
> >   mixed in the same function. I.e. for a given routine, convert all
> >   resources that need a "goto" cleanup to scope-based cleanup, or
> >   convert none of them.
> > 
> > which can either be interpreted as "Don't do it unless you know what
> > you are doing" or "There is at least one compiler that will get a
> > goto + cleanup guard wrong" and to err on side of caution, I
> > suggested we do break + enums.
> > 
> > If there are no concerns, then the suggested diff is indeed much
> > better.
> 
> IIRC the concern was doing partial error handling conversions and
> getting it hopelessly wrong.
> 
> And while some GCC's generate wrong code when you goto into a guard, all
> clangs ever will error on that, so any such code should not survive the
> robots.
> 
> And then there was an issue with computed gotos and asm_goto, but I the
> former are exceedingly rare (and again, clang will error IIRC) and the
> latter we upped the minimum clang version for.
> 
> Anyway, there is nothing inherently wrong with using goto to exit a
> scope and it works well.

That is, consider this:

void *foo(int bar)
{
	int err;

	something_1();

	err = register_something(..);
	if (!err)
		goto unregister;

	void *obj __free(kfree) = kzalloc_obj(...);

	....

	return_ptr(obj);

unregister:
	undo_something_1();
	return ERR_PTR(err);
}

Looks okay, right? Except note how 'unregister' is inside the scope of
@obj.

(And this compiles 'fine' with various GCC)

This is the kind of errors that you get from partial error handling
conversion and is why Dan wrote what he did.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-03-27 13:33     ` K Prateek Nayak
  2026-03-27 15:20       ` Peter Zijlstra
@ 2026-03-27 16:00       ` Peter Zijlstra
  2026-03-27 16:57         ` K Prateek Nayak
  1 sibling, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2026-03-27 16:00 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Mar 27, 2026 at 07:03:19PM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 3/27/2026 5:18 PM, Peter Zijlstra wrote:
> > I tried to have a quick look, but I find it *very* hard to make sense of
> > the differences.
> 
> Couple of concerns I had with the current approach is:
> 
> 1. Why can't we simply do block_task() + wake_up_process() for return
>    migration?

So the way things are set up now, we have the blocked task 'on_rq', so
ttwu() will take ttwu_runnable() path, and we wake the task on the
'wrong' CPU.

At this point '->state == TASK_RUNNABLE' and schedule() will pick it and
... we hit '->blocked_on == PROXY_WAKING', which leads to
proxy_force_return(), which does deactivate_task()+activate_task() as
per a normal migration, and then all is well.

Right?

You're asking why proxy_force_return() doesn't use block_task()+ttwu()?
That seems really wrong at that point -- after all: '->state ==
TASK_RUNNABLE'.

Or; are you asking why we don't block_task() at the point where we set
'->blocked_on = PROXY_WAKING'? And then let ttwu() sort things out?

I suspect the latter is really hard to do vs lock ordering, but I've not
thought it through.

One thing you *can* do it frob ttwu_runnable() to 'refuse' to wake the
task, and then it goes into the normal path and will do the migration.
I've done things like that before.

Does that fix all the return-migration cases?

> 2. Why does proxy_needs_return() (this comes later in John's tree but I
>    moved it up ahead)  need the proxy_task_runnable_but_waking() override
>    of the ttwu_state_mach() machinery?
>    (https://github.com/johnstultz-work/linux-dev/commit/28ad4d3fa847b90713ca18a623d1ee7f73b648d9)

Since it comes later, I've not seen it and not given it thought ;-)

(I mean, I've probably seen it at some point, but being the gold-fish
that I am, I have no recollection, so I might as well not have seen it).

A brief look now makes me confused. The comment fails to describe how
that situation could ever come to pass.

> 3. How can proxy_deactivate() see a TASK_RUNNING for blocked donor?

I was looking at that.. I'm not sure. I mean, having the clause doesn't
hurt, but yeah, dunno.

> Speaking of that commit, I would like you or Juri to confirm if it is
> okay to set a throttled deadline task as rq->donor for a while until it
> hits resched.

I think that should be okay.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-03-27 16:00       ` Peter Zijlstra
@ 2026-03-27 16:57         ` K Prateek Nayak
  2026-04-02 15:50           ` Peter Zijlstra
  0 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-27 16:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello Peter,

On 3/27/2026 9:30 PM, Peter Zijlstra wrote:
> On Fri, Mar 27, 2026 at 07:03:19PM +0530, K Prateek Nayak wrote:
>> Hello Peter,
>>
>> On 3/27/2026 5:18 PM, Peter Zijlstra wrote:
>>> I tried to have a quick look, but I find it *very* hard to make sense of
>>> the differences.
>>
>> Couple of concerns I had with the current approach is:
>>
>> 1. Why can't we simply do block_task() + wake_up_process() for return
>>    migration?
> 
> So the way things are set up now, we have the blocked task 'on_rq', so
> ttwu() will take ttwu_runnable() path, and we wake the task on the
> 'wrong' CPU.
> 
> At this point '->state == TASK_RUNNABLE' and schedule() will pick it and
> ... we hit '->blocked_on == PROXY_WAKING', which leads to
> proxy_force_return(), which does deactivate_task()+activate_task() as
> per a normal migration, and then all is well.
> 
> Right?
> 
> You're asking why proxy_force_return() doesn't use block_task()+ttwu()?
> That seems really wrong at that point -- after all: '->state ==
> TASK_RUNNABLE'.
> 
> Or; are you asking why we don't block_task() at the point where we set
> '->blocked_on = PROXY_WAKING'? And then let ttwu() sort things out?
> 
> I suspect the latter is really hard to do vs lock ordering, but I've not
> thought it through.

So taking a step back, this is what we have today (at least the
common scenario):

    CPU0 (donor - A)                         CPU1 (owner - B)
    ================                         ================

mutex_lock()
  __set_current_state(TASK_INTERRUPTIBLE)
  __set_task_blocked_on(M)
    schedule()
      /* Retained for proxy */
      proxy_migrate_task()
        ==================================> /* Migrates to CPU1 */
  ...
  send_sig(B)
    signal_wake_up_state()
      wake_up_state()
        try_to_wake_up()
          ttwu_runnable()
            ttwu_do_wakeup() =============> /* A->__state = TASK_RUNNING */

                                            /*
                                             * After this point ttwu_state_match()
                                             * will fail for A so a mutex_unlock()
                                             * will have to go through __schedule()
                                             * for return migration.
                                             */

                                            __schedule()
                                              find_proxy_task()

                                                /* Scenario 1 - B sleeps */
                                                __clear_task_blocked_on()
                                                proxy_deactivate(A)
                                                  /* A->__state == TASK_RUNNING */
                                                  /* fallthrough */

                                                /* Scenario 2 - return migration after unlock() */
                                                __clear_task_blocked_on()
                                                /*
                                                 * At this point proxy stops.
                                                 * Much later after signal.
                                                 */
                                                proxy_force_return()
    schedule() <==================================
      signal_pending_state()

  clear_task_blocked_on()
  __set_current_state(TASK_RUNNING)

... /* return with -EINR */


Basically, a blocked donor has to wait for a mutex_unlock() before it
can go process the signal and bail out on the mutex_lock_interruptible()
which seems counter productive - but it is still okay from correctness
perspective.

> 
> One thing you *can* do it frob ttwu_runnable() to 'refuse' to wake the
> task, and then it goes into the normal path and will do the migration.
> I've done things like that before.
> 
> Does that fix all the return-migration cases?

Yes it does! If we handle the return via ttwu_runnable(), which is what
proxy_needs_return() in the next chunk of changes aims to do and we can
build the invariant that TASK_RUNNING + task_is_blocked() is an illegal
state outside of __schedule() which works well with ttwu_state_match().

> 
>> 2. Why does proxy_needs_return() (this comes later in John's tree but I
>>    moved it up ahead)  need the proxy_task_runnable_but_waking() override
>>    of the ttwu_state_mach() machinery?
>>    (https://github.com/johnstultz-work/linux-dev/commit/28ad4d3fa847b90713ca18a623d1ee7f73b648d9)
> 
> Since it comes later, I've not seen it and not given it thought ;-)
> 
> (I mean, I've probably seen it at some point, but being the gold-fish
> that I am, I have no recollection, so I might as well not have seen it).
> 
> A brief look now makes me confused. The comment fails to describe how
> that situation could ever come to pass.

That is a signal delivery happening before unlock which will force
TASK_RUNNING but since we are waiting on an unlock, the wakeup from
unlock will see TASK_RUNNING + PROXY_WAKING.

We then later force it on the ttwu path to do return via
ttwu_runnable().

> 
>> 3. How can proxy_deactivate() see a TASK_RUNNING for blocked donor?
> 
> I was looking at that.. I'm not sure. I mean, having the clause doesn't
> hurt, but yeah, dunno.

Outlined in that flow above - Scenario 1.

> 
> 
>> Speaking of that commit, I would like you or Juri to confirm if it is
>> okay to set a throttled deadline task as rq->donor for a while until it
>> hits resched.
> 
> I think that should be okay.

Good to know! Are you planning to push out the changes to queue? I can
send an RFC with the patches from my tree on top and we can perhaps
discuss it piecewise next week. Then we can decide if we want those
changes or not ;-)

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-03-27 16:57         ` K Prateek Nayak
@ 2026-04-02 15:50           ` Peter Zijlstra
  2026-04-02 18:31             ` John Stultz
  0 siblings, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2026-04-02 15:50 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Mar 27, 2026 at 10:27:08PM +0530, K Prateek Nayak wrote:

> So taking a step back, this is what we have today (at least the
> common scenario):
> 
>     CPU0 (donor - A)                         CPU1 (owner - B)
>     ================                         ================
> 
> mutex_lock()
>   __set_current_state(TASK_INTERRUPTIBLE)
>   __set_task_blocked_on(M)
>     schedule()
>       /* Retained for proxy */
>       proxy_migrate_task()
>         ==================================> /* Migrates to CPU1 */
>   ...
>   send_sig(B)
>     signal_wake_up_state()
>       wake_up_state()
>         try_to_wake_up()
>           ttwu_runnable()
>             ttwu_do_wakeup() =============> /* A->__state = TASK_RUNNING */
> 
>                                             /*
>                                              * After this point ttwu_state_match()
>                                              * will fail for A so a mutex_unlock()
>                                              * will have to go through __schedule()
>                                              * for return migration.
>                                              */
> 
>                                             __schedule()
>                                               find_proxy_task()
> 
>                                                 /* Scenario 1 - B sleeps */
>                                                 __clear_task_blocked_on()
>                                                 proxy_deactivate(A)
>                                                   /* A->__state == TASK_RUNNING */
>                                                   /* fallthrough */
> 
>                                                 /* Scenario 2 - return migration after unlock() */
>                                                 __clear_task_blocked_on()
>                                                 /*
>                                                  * At this point proxy stops.
>                                                  * Much later after signal.
>                                                  */
>                                                 proxy_force_return()
>     schedule() <==================================
>       signal_pending_state()
> 
>   clear_task_blocked_on()
>   __set_current_state(TASK_RUNNING)
> 
> ... /* return with -EINR */
> 
> 
> Basically, a blocked donor has to wait for a mutex_unlock() before it
> can go process the signal and bail out on the mutex_lock_interruptible()
> which seems counter productive - but it is still okay from correctness
> perspective.
> 
> > 
> > One thing you *can* do it frob ttwu_runnable() to 'refuse' to wake the
> > task, and then it goes into the normal path and will do the migration.
> > I've done things like that before.
> > 
> > Does that fix all the return-migration cases?
> 
> Yes it does! If we handle the return via ttwu_runnable(), which is what
> proxy_needs_return() in the next chunk of changes aims to do and we can
> build the invariant that TASK_RUNNING + task_is_blocked() is an illegal
> state outside of __schedule() which works well with ttwu_state_match().
> 
> > 
> >> 2. Why does proxy_needs_return() (this comes later in John's tree but I
> >>    moved it up ahead)  need the proxy_task_runnable_but_waking() override
> >>    of the ttwu_state_mach() machinery?
> >>    (https://github.com/johnstultz-work/linux-dev/commit/28ad4d3fa847b90713ca18a623d1ee7f73b648d9)
> > 
> > Since it comes later, I've not seen it and not given it thought ;-)
> > 
> > (I mean, I've probably seen it at some point, but being the gold-fish
> > that I am, I have no recollection, so I might as well not have seen it).
> > 
> > A brief look now makes me confused. The comment fails to describe how
> > that situation could ever come to pass.
> 
> That is a signal delivery happening before unlock which will force
> TASK_RUNNING but since we are waiting on an unlock, the wakeup from
> unlock will see TASK_RUNNING + PROXY_WAKING.
> 
> We then later force it on the ttwu path to do return via
> ttwu_runnable().

So, I've not gone through all the cases yet, and it is *COMPLETELY*
untested, but something like the below perhaps?

---
 include/linux/sched.h |    2 
 kernel/sched/core.c   |  173 ++++++++++++++++----------------------------------
 2 files changed, 58 insertions(+), 117 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -161,7 +161,7 @@ struct user_event_mm;
  */
 #define is_special_task_state(state)					\
 	((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED |	\
-		    TASK_DEAD | TASK_FROZEN))
+		    TASK_DEAD | TASK_WAKING | TASK_FROZEN))
 
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 # define debug_normal_state_change(state_value)				\
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2160,8 +2160,29 @@ void deactivate_task(struct rq *rq, stru
 	dequeue_task(rq, p, flags);
 }
 
-static void block_task(struct rq *rq, struct task_struct *p, int flags)
+static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
 {
+	int flags = DEQUEUE_NOCLOCK;
+
+	p->sched_contributes_to_load =
+		(task_state & TASK_UNINTERRUPTIBLE) &&
+		!(task_state & TASK_NOLOAD) &&
+		!(task_state & TASK_FROZEN);
+
+	if (unlikely(is_special_task_state(task_state)))
+		flags |= DEQUEUE_SPECIAL;
+
+	/*
+	 * __schedule()			ttwu()
+	 *   prev_state = prev->state;    if (p->on_rq && ...)
+	 *   if (prev_state)		    goto out;
+	 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
+	 *				  p->state = TASK_WAKING
+	 *
+	 * Where __schedule() and ttwu() have matching control dependencies.
+	 *
+	 * After this, schedule() must not care about p->state any more.
+	 */
 	if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
 		__block_task(rq, p);
 }
@@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
-	struct rq_flags rf;
-	struct rq *rq;
-	int ret = 0;
+	ACQUIRE(__task_rq_lock, guard)(p);
+	struct rq *rq = guard.rq;
 
-	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
-		if (!task_on_cpu(rq, p)) {
+	if (!task_on_rq_queued(p))
+		return 0;
+
+	if (sched_proxy_exec() && p->blocked_on) {
+		guard(raw_spinlock)(&p->blocked_lock);
+		struct mutex *lock = p->blocked_on;
+		if (lock) {
 			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
+			 * TASK_WAKING is a special state and results in
+			 * DEQUEUE_SPECIAL such that the task will actually be
+			 * forced from the runqueue.
 			 */
-			wakeup_preempt(rq, p, wake_flags);
+			block_task(rq, p, TASK_WAKING);
+			p->blocked_on = NULL;
+			return 0;
 		}
-		ttwu_do_wakeup(p);
-		ret = 1;
 	}
-	__task_rq_unlock(rq, p, &rf);
 
-	return ret;
+	update_rq_clock(rq);
+	if (p->se.sched_delayed)
+		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		wakeup_preempt(rq, p, wake_flags);
+	}
+	ttwu_do_wakeup(p);
+	return 1;
 }
 
 void sched_ttwu_pending(void *arg)
@@ -6519,7 +6551,6 @@ static bool try_to_block_task(struct rq
 			      unsigned long *task_state_p, bool should_block)
 {
 	unsigned long task_state = *task_state_p;
-	int flags = DEQUEUE_NOCLOCK;
 
 	if (signal_pending_state(task_state, p)) {
 		WRITE_ONCE(p->__state, TASK_RUNNING);
@@ -6539,26 +6570,7 @@ static bool try_to_block_task(struct rq
 	if (!should_block)
 		return false;
 
-	p->sched_contributes_to_load =
-		(task_state & TASK_UNINTERRUPTIBLE) &&
-		!(task_state & TASK_NOLOAD) &&
-		!(task_state & TASK_FROZEN);
-
-	if (unlikely(is_special_task_state(task_state)))
-		flags |= DEQUEUE_SPECIAL;
-
-	/*
-	 * __schedule()			ttwu()
-	 *   prev_state = prev->state;    if (p->on_rq && ...)
-	 *   if (prev_state)		    goto out;
-	 *     p->on_rq = 0;		  smp_acquire__after_ctrl_dep();
-	 *				  p->state = TASK_WAKING
-	 *
-	 * Where __schedule() and ttwu() have matching control dependencies.
-	 *
-	 * After this, schedule() must not care about p->state any more.
-	 */
-	block_task(rq, p, flags);
+	block_task(rq, p, task_state);
 	return true;
 }
 
@@ -6586,13 +6598,12 @@ static inline struct task_struct *proxy_
 	return rq->idle;
 }
 
-static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
+static void proxy_deactivate(struct rq *rq, struct task_struct *donor)
 {
 	unsigned long state = READ_ONCE(donor->__state);
 
-	/* Don't deactivate if the state has been changed to TASK_RUNNING */
-	if (state == TASK_RUNNING)
-		return false;
+	WARN_ON_ONCE(state == TASK_RUNNING);
+
 	/*
 	 * Because we got donor from pick_next_task(), it is *crucial*
 	 * that we call proxy_resched_idle() before we deactivate it.
@@ -6603,7 +6614,7 @@ static bool proxy_deactivate(struct rq *
 	 * need to be changed from next *before* we deactivate.
 	 */
 	proxy_resched_idle(rq);
-	return try_to_block_task(rq, donor, &state, true);
+	block_task(rq, donor, state);
 }
 
 static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
@@ -6677,71 +6688,6 @@ static void proxy_migrate_task(struct rq
 	proxy_reacquire_rq_lock(rq, rf);
 }
 
-static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
-			       struct task_struct *p)
-	__must_hold(__rq_lockp(rq))
-{
-	struct rq *task_rq, *target_rq = NULL;
-	int cpu, wake_flag = WF_TTWU;
-
-	lockdep_assert_rq_held(rq);
-	WARN_ON(p == rq->curr);
-
-	if (p == rq->donor)
-		proxy_resched_idle(rq);
-
-	proxy_release_rq_lock(rq, rf);
-	/*
-	 * We drop the rq lock, and re-grab task_rq_lock to get
-	 * the pi_lock (needed for select_task_rq) as well.
-	 */
-	scoped_guard (task_rq_lock, p) {
-		task_rq = scope.rq;
-
-		/*
-		 * Since we let go of the rq lock, the task may have been
-		 * woken or migrated to another rq before we  got the
-		 * task_rq_lock. So re-check we're on the same RQ. If
-		 * not, the task has already been migrated and that CPU
-		 * will handle any futher migrations.
-		 */
-		if (task_rq != rq)
-			break;
-
-		/*
-		 * Similarly, if we've been dequeued, someone else will
-		 * wake us
-		 */
-		if (!task_on_rq_queued(p))
-			break;
-
-		/*
-		 * Since we should only be calling here from __schedule()
-		 * -> find_proxy_task(), no one else should have
-		 * assigned current out from under us. But check and warn
-		 * if we see this, then bail.
-		 */
-		if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
-			WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
-				  __func__, cpu_of(task_rq),
-				  p->comm, p->pid, p->on_cpu);
-			break;
-		}
-
-		update_rq_clock(task_rq);
-		deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
-		cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
-		set_task_cpu(p, cpu);
-		target_rq = cpu_rq(cpu);
-		clear_task_blocked_on(p, NULL);
-	}
-
-	if (target_rq)
-		attach_one_task(target_rq, p);
-
-	proxy_reacquire_rq_lock(rq, rf);
-}
-
 /*
  * Find runnable lock owner to proxy for mutex blocked donor
  *
@@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
 				clear_task_blocked_on(p, PROXY_WAKING);
 				return p;
 			}
-			goto force_return;
+			goto deactivate;
 		}
 
 		/*
@@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
 				__clear_task_blocked_on(p, NULL);
 				return p;
 			}
-			goto force_return;
+			goto deactivate;
 		}
 
 		if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
@@ -6891,12 +6837,7 @@ find_proxy_task(struct rq *rq, struct ta
 	return owner;
 
 deactivate:
-	if (proxy_deactivate(rq, donor))
-		return NULL;
-	/* If deactivate fails, force return */
-	p = donor;
-force_return:
-	proxy_force_return(rq, rf, p);
+	proxy_deactivate(rq, donor);
 	return NULL;
 migrate_task:
 	proxy_migrate_task(rq, rf, p, owner_cpu);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-02 15:50           ` Peter Zijlstra
@ 2026-04-02 18:31             ` John Stultz
  2026-04-02 21:04               ` John Stultz
                                 ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: John Stultz @ 2026-04-02 18:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Apr 2, 2026 at 8:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Mar 27, 2026 at 10:27:08PM +0530, K Prateek Nayak wrote:
>
> > So taking a step back, this is what we have today (at least the
> > common scenario):
> >
> >     CPU0 (donor - A)                         CPU1 (owner - B)
> >     ================                         ================
> >
> > mutex_lock()
> >   __set_current_state(TASK_INTERRUPTIBLE)
> >   __set_task_blocked_on(M)
> >     schedule()
> >       /* Retained for proxy */
> >       proxy_migrate_task()
> >         ==================================> /* Migrates to CPU1 */
> >   ...
> >   send_sig(B)
> >     signal_wake_up_state()
> >       wake_up_state()
> >         try_to_wake_up()
> >           ttwu_runnable()
> >             ttwu_do_wakeup() =============> /* A->__state = TASK_RUNNING */
> >
> >                                             /*
> >                                              * After this point ttwu_state_match()
> >                                              * will fail for A so a mutex_unlock()
> >                                              * will have to go through __schedule()
> >                                              * for return migration.
> >                                              */
> >
> >                                             __schedule()
> >                                               find_proxy_task()
> >
> >                                                 /* Scenario 1 - B sleeps */
> >                                                 __clear_task_blocked_on()
> >                                                 proxy_deactivate(A)
> >                                                   /* A->__state == TASK_RUNNING */
> >                                                   /* fallthrough */
> >
> >                                                 /* Scenario 2 - return migration after unlock() */
> >                                                 __clear_task_blocked_on()
> >                                                 /*
> >                                                  * At this point proxy stops.
> >                                                  * Much later after signal.
> >                                                  */
> >                                                 proxy_force_return()
> >     schedule() <==================================
> >       signal_pending_state()
> >
> >   clear_task_blocked_on()
> >   __set_current_state(TASK_RUNNING)
> >
> > ... /* return with -EINR */
> >
> >
> > Basically, a blocked donor has to wait for a mutex_unlock() before it
> > can go process the signal and bail out on the mutex_lock_interruptible()
> > which seems counter productive - but it is still okay from correctness
> > perspective.
> >
> > >
> > > One thing you *can* do it frob ttwu_runnable() to 'refuse' to wake the
> > > task, and then it goes into the normal path and will do the migration.
> > > I've done things like that before.
> > >
> > > Does that fix all the return-migration cases?
> >
> > Yes it does! If we handle the return via ttwu_runnable(), which is what
> > proxy_needs_return() in the next chunk of changes aims to do and we can
> > build the invariant that TASK_RUNNING + task_is_blocked() is an illegal
> > state outside of __schedule() which works well with ttwu_state_match().
> >
> > >
> > >> 2. Why does proxy_needs_return() (this comes later in John's tree but I
> > >>    moved it up ahead)  need the proxy_task_runnable_but_waking() override
> > >>    of the ttwu_state_mach() machinery?
> > >>    (https://github.com/johnstultz-work/linux-dev/commit/28ad4d3fa847b90713ca18a623d1ee7f73b648d9)
> > >
> > > Since it comes later, I've not seen it and not given it thought ;-)
> > >
> > > (I mean, I've probably seen it at some point, but being the gold-fish
> > > that I am, I have no recollection, so I might as well not have seen it).
> > >
> > > A brief look now makes me confused. The comment fails to describe how
> > > that situation could ever come to pass.
> >
> > That is a signal delivery happening before unlock which will force
> > TASK_RUNNING but since we are waiting on an unlock, the wakeup from
> > unlock will see TASK_RUNNING + PROXY_WAKING.
> >
> > We then later force it on the ttwu path to do return via
> > ttwu_runnable().
>
> So, I've not gone through all the cases yet, and it is *COMPLETELY*
> untested, but something like the below perhaps?
>

So, just to clarify, this suggestion is as an alternative to my
return-migration via ttwu logic (not included in the v26
simple-donor-migration chunk you've queued)?
   https://github.com/johnstultz-work/linux-dev/commit/dfaa472f2a0b2f6a0c73083aaba5c55e256fdb56

I'm working on prepping that next chunk of the series (which includes
that logic) to send out here shortly (integrating the few bits I from
Prateek that I've managerd to get my head around).

There's still a bunch of other changes tied into waking a rq->donor
outside of __schedule() in that chunk, so I'm not sure if this
discussion would be easier to have in context once those are on the
list?

So some of my (certainly confused) thoughts below...

> ---
>  include/linux/sched.h |    2
>  kernel/sched/core.c   |  173 ++++++++++++++++----------------------------------
>  2 files changed, 58 insertions(+), 117 deletions(-)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -161,7 +161,7 @@ struct user_event_mm;
>   */
>  #define is_special_task_state(state)                                   \
>         ((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED |      \
> -                   TASK_DEAD | TASK_FROZEN))
> +                   TASK_DEAD | TASK_WAKING | TASK_FROZEN))
>
>  #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
>  # define debug_normal_state_change(state_value)                                \
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
...
> @@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -       struct rq_flags rf;
> -       struct rq *rq;
> -       int ret = 0;
> +       ACQUIRE(__task_rq_lock, guard)(p);
> +       struct rq *rq = guard.rq;
>
> -       rq = __task_rq_lock(p, &rf);
> -       if (task_on_rq_queued(p)) {
> -               update_rq_clock(rq);
> -               if (p->se.sched_delayed)
> -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -               if (!task_on_cpu(rq, p)) {
> +       if (!task_on_rq_queued(p))
> +               return 0;
> +
> +       if (sched_proxy_exec() && p->blocked_on) {
> +               guard(raw_spinlock)(&p->blocked_lock);
> +               struct mutex *lock = p->blocked_on;
> +               if (lock) {
>                         /*
> -                        * When on_rq && !on_cpu the task is preempted, see if
> -                        * it should preempt the task that is current now.
> +                        * TASK_WAKING is a special state and results in
> +                        * DEQUEUE_SPECIAL such that the task will actually be
> +                        * forced from the runqueue.
>                          */
> -                       wakeup_preempt(rq, p, wake_flags);
> +                       block_task(rq, p, TASK_WAKING);
> +                       p->blocked_on = NULL;
> +                       return 0;
>                 }
> -               ttwu_do_wakeup(p);
> -               ret = 1;
>         }
> -       __task_rq_unlock(rq, p, &rf);
>
> -       return ret;
> +       update_rq_clock(rq);
> +       if (p->se.sched_delayed)
> +               enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);

I can't precisely remember the details now, but I believe we need to
handle enqueueing sched_delayed tasks before handling blocked_on
tasks.


> +       if (!task_on_cpu(rq, p)) {
> +               /*
> +                * When on_rq && !on_cpu the task is preempted, see if
> +                * it should preempt the task that is current now.
> +                */
> +               wakeup_preempt(rq, p, wake_flags);
> +       }
> +       ttwu_do_wakeup(p);
> +       return 1;
>  }
>
>  void sched_ttwu_pending(void *arg)
...
> @@ -6586,13 +6598,12 @@ static inline struct task_struct *proxy_
>         return rq->idle;
>  }
>
> -static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
> +static void proxy_deactivate(struct rq *rq, struct task_struct *donor)
>  {
>         unsigned long state = READ_ONCE(donor->__state);
>
> -       /* Don't deactivate if the state has been changed to TASK_RUNNING */
> -       if (state == TASK_RUNNING)
> -               return false;
> +       WARN_ON_ONCE(state == TASK_RUNNING);
> +
>         /*
>          * Because we got donor from pick_next_task(), it is *crucial*
>          * that we call proxy_resched_idle() before we deactivate it.
> @@ -6603,7 +6614,7 @@ static bool proxy_deactivate(struct rq *
>          * need to be changed from next *before* we deactivate.
>          */
>         proxy_resched_idle(rq);
> -       return try_to_block_task(rq, donor, &state, true);
> +       block_task(rq, donor, state);
>  }
>
>  static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
> @@ -6677,71 +6688,6 @@ static void proxy_migrate_task(struct rq
>         proxy_reacquire_rq_lock(rq, rf);
>  }
>
> -static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> -                              struct task_struct *p)
> -       __must_hold(__rq_lockp(rq))
> -{
> -       struct rq *task_rq, *target_rq = NULL;
> -       int cpu, wake_flag = WF_TTWU;
> -
> -       lockdep_assert_rq_held(rq);
> -       WARN_ON(p == rq->curr);
> -
> -       if (p == rq->donor)
> -               proxy_resched_idle(rq);
> -
> -       proxy_release_rq_lock(rq, rf);
> -       /*
> -        * We drop the rq lock, and re-grab task_rq_lock to get
> -        * the pi_lock (needed for select_task_rq) as well.
> -        */
> -       scoped_guard (task_rq_lock, p) {
> -               task_rq = scope.rq;
> -
> -               /*
> -                * Since we let go of the rq lock, the task may have been
> -                * woken or migrated to another rq before we  got the
> -                * task_rq_lock. So re-check we're on the same RQ. If
> -                * not, the task has already been migrated and that CPU
> -                * will handle any futher migrations.
> -                */
> -               if (task_rq != rq)
> -                       break;
> -
> -               /*
> -                * Similarly, if we've been dequeued, someone else will
> -                * wake us
> -                */
> -               if (!task_on_rq_queued(p))
> -                       break;
> -
> -               /*
> -                * Since we should only be calling here from __schedule()
> -                * -> find_proxy_task(), no one else should have
> -                * assigned current out from under us. But check and warn
> -                * if we see this, then bail.
> -                */
> -               if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
> -                       WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
> -                                 __func__, cpu_of(task_rq),
> -                                 p->comm, p->pid, p->on_cpu);
> -                       break;
> -               }
> -
> -               update_rq_clock(task_rq);
> -               deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
> -               cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
> -               set_task_cpu(p, cpu);
> -               target_rq = cpu_rq(cpu);
> -               clear_task_blocked_on(p, NULL);
> -       }
> -
> -       if (target_rq)
> -               attach_one_task(target_rq, p);
> -
> -       proxy_reacquire_rq_lock(rq, rf);
> -}
> -
>  /*
>   * Find runnable lock owner to proxy for mutex blocked donor
>   *
> @@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
>                                 clear_task_blocked_on(p, PROXY_WAKING);
>                                 return p;
>                         }
> -                       goto force_return;
> +                       goto deactivate;
>                 }
>
>                 /*
> @@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
>                                 __clear_task_blocked_on(p, NULL);
>                                 return p;
>                         }
> -                       goto force_return;
> +                       goto deactivate;
>                 }
>
>                 if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
> @@ -6891,12 +6837,7 @@ find_proxy_task(struct rq *rq, struct ta
>         return owner;
>
>  deactivate:
> -       if (proxy_deactivate(rq, donor))
> -               return NULL;
> -       /* If deactivate fails, force return */
> -       p = donor;
> -force_return:
> -       proxy_force_return(rq, rf, p);
> +       proxy_deactivate(rq, donor);
>         return NULL;
>  migrate_task:
>         proxy_migrate_task(rq, rf, p, owner_cpu);

So I like getting rid of proxy_force_return(), but its not clear to me
that proxy_deactivate() is what we want to do in these
find_proxy_task() edge cases.

It feels like if we are already racing with ttwu, deactivating the
task seems like it might open more windows where we might lose the
wakeup.

In fact, the whole reason we have proxy_force_return() is that earlier
in the proxy-exec development, when we hit those edge cases we usually
would return proxy_reschedule_idle() just to drop the rq lock and let
ttwu do its thing, but there kept on being cases where we would end up
with lost wakeups.

But I'll give this a shot (and will integrate your ttwu_runnable
cleanups regardless) and see how it does.

thanks
-john

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-02 18:31             ` John Stultz
@ 2026-04-02 21:04               ` John Stultz
  2026-04-03  6:09               ` K Prateek Nayak
  2026-04-03  9:18               ` Peter Zijlstra
  2 siblings, 0 replies; 56+ messages in thread
From: John Stultz @ 2026-04-02 21:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Apr 2, 2026 at 11:31 AM John Stultz <jstultz@google.com> wrote:
> On Thu, Apr 2, 2026 at 8:51 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So, I've not gone through all the cases yet, and it is *COMPLETELY*
> > untested, but something like the below perhaps?
> >
> But I'll give this a shot (and will integrate your ttwu_runnable
> cleanups regardless) and see how it does.

So I tweaked it to move the blocked_on handling after the
p->se.sched_delayed check in ttwu_runnable(), otherwise it crashed
very quickly.

But even then, I unfortunately quickly see the WARN_ON_ONCE(state ==
TASK_RUNNING); in proxy_deactivate() trip and then it hangs after
starting the mutex_lock-torture tests. :(

I'll try to get my next set cleaned up and shared here shortly and
maybe we can try to whittle it down to something closer to your
approach.

thanks
-john

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-02 18:31             ` John Stultz
  2026-04-02 21:04               ` John Stultz
@ 2026-04-03  6:09               ` K Prateek Nayak
  2026-04-03  9:52                 ` Peter Zijlstra
  2026-04-03  9:18               ` Peter Zijlstra
  2 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-04-03  6:09 UTC (permalink / raw)
  To: John Stultz, Peter Zijlstra
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello peter, John,

On 4/3/2026 12:01 AM, John Stultz wrote:
>> So, I've not gone through all the cases yet, and it is *COMPLETELY*
>> untested, but something like the below perhaps?
>>
> 
> So, just to clarify, this suggestion is as an alternative to my
> return-migration via ttwu logic (not included in the v26
> simple-donor-migration chunk you've queued)?
>    https://github.com/johnstultz-work/linux-dev/commit/dfaa472f2a0b2f6a0c73083aaba5c55e256fdb56
> 
> I'm working on prepping that next chunk of the series (which includes
> that logic) to send out here shortly (integrating the few bits I from
> Prateek that I've managerd to get my head around).

Thanks and I agree that it might be more digestible that way.

> 
> There's still a bunch of other changes tied into waking a rq->donor
> outside of __schedule() in that chunk, so I'm not sure if this
> discussion would be easier to have in context once those are on the
> list?
> 
> So some of my (certainly confused) thoughts below...
> 
>> ---
>>  include/linux/sched.h |    2
>>  kernel/sched/core.c   |  173 ++++++++++++++++----------------------------------
>>  2 files changed, 58 insertions(+), 117 deletions(-)
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -161,7 +161,7 @@ struct user_event_mm;
>>   */
>>  #define is_special_task_state(state)                                   \
>>         ((state) & (__TASK_STOPPED | __TASK_TRACED | TASK_PARKED |      \
>> -                   TASK_DEAD | TASK_FROZEN))
>> +                   TASK_DEAD | TASK_WAKING | TASK_FROZEN))
>>
>>  #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
>>  # define debug_normal_state_change(state_value)                                \
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
> ...
>> @@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t
>>   */
>>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>>  {
>> -       struct rq_flags rf;
>> -       struct rq *rq;
>> -       int ret = 0;
>> +       ACQUIRE(__task_rq_lock, guard)(p);
>> +       struct rq *rq = guard.rq;
>>
>> -       rq = __task_rq_lock(p, &rf);
>> -       if (task_on_rq_queued(p)) {
>> -               update_rq_clock(rq);
>> -               if (p->se.sched_delayed)
>> -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>> -               if (!task_on_cpu(rq, p)) {
>> +       if (!task_on_rq_queued(p))
>> +               return 0;
>> +
>> +       if (sched_proxy_exec() && p->blocked_on) {
>> +               guard(raw_spinlock)(&p->blocked_lock);
>> +               struct mutex *lock = p->blocked_on;
>> +               if (lock) {
>>                         /*
>> -                        * When on_rq && !on_cpu the task is preempted, see if
>> -                        * it should preempt the task that is current now.
>> +                        * TASK_WAKING is a special state and results in
>> +                        * DEQUEUE_SPECIAL such that the task will actually be
>> +                        * forced from the runqueue.
>>                          */
>> -                       wakeup_preempt(rq, p, wake_flags);
>> +                       block_task(rq, p, TASK_WAKING);

This needs to reset the rq->donor if the task getting woken up is the
current donor.

>> +                       p->blocked_on = NULL;
>> +                       return 0;
>>                 }
>> -               ttwu_do_wakeup(p);
>> -               ret = 1;
>>         }
>> -       __task_rq_unlock(rq, p, &rf);
>>
>> -       return ret;
>> +       update_rq_clock(rq);

nit. Since block_task() adds a DEQUEUE_NOCLOCK now we need to move that
clock update before the block.

>> +       if (p->se.sched_delayed)
>> +               enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> 
> I can't precisely remember the details now, but I believe we need to
> handle enqueueing sched_delayed tasks before handling blocked_on
> tasks.

So proxy_deactivate() can still delay the task leading to
task_on_rq_queued() and the wakeup coming to ttwu_runnable() so either
we can dequeue it fully in proxy_deactivate() or we need to teach
block_task() to add a DEQUEUE_DELAYED flag when task_is_blocked().

I think the former is cleaner but we don't decay lag for fair task :-(

We can't simply re-enqueue it either since proxy migration might have
put it on a CPU outside its affinity mask so we need to take a full
dequeue + wakeup in ttwu_runnable().

> 
> 
>> +       if (!task_on_cpu(rq, p)) {
>> +               /*
>> +                * When on_rq && !on_cpu the task is preempted, see if
>> +                * it should preempt the task that is current now.
>> +                */
>> +               wakeup_preempt(rq, p, wake_flags);
>> +       }
>> +       ttwu_do_wakeup(p);
>> +       return 1;
>>  }
>>
>>  void sched_ttwu_pending(void *arg)
> ...
>> @@ -6586,13 +6598,12 @@ static inline struct task_struct *proxy_
>>         return rq->idle;
>>  }
>>
>> -static bool proxy_deactivate(struct rq *rq, struct task_struct *donor)
>> +static void proxy_deactivate(struct rq *rq, struct task_struct *donor)
>>  {
>>         unsigned long state = READ_ONCE(donor->__state);
>>
>> -       /* Don't deactivate if the state has been changed to TASK_RUNNING */
>> -       if (state == TASK_RUNNING)
>> -               return false;
>> +       WARN_ON_ONCE(state == TASK_RUNNING);
>> +
>>         /*
>>          * Because we got donor from pick_next_task(), it is *crucial*
>>          * that we call proxy_resched_idle() before we deactivate it.
>> @@ -6603,7 +6614,7 @@ static bool proxy_deactivate(struct rq *
>>          * need to be changed from next *before* we deactivate.
>>          */
>>         proxy_resched_idle(rq);
>> -       return try_to_block_task(rq, donor, &state, true);
>> +       block_task(rq, donor, state);
>>  }
>>
>>  static inline void proxy_release_rq_lock(struct rq *rq, struct rq_flags *rf)
>> @@ -6677,71 +6688,6 @@ static void proxy_migrate_task(struct rq
>>         proxy_reacquire_rq_lock(rq, rf);
>>  }
>>
>> -static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
>> -                              struct task_struct *p)
>> -       __must_hold(__rq_lockp(rq))
>> -{
>> -       struct rq *task_rq, *target_rq = NULL;
>> -       int cpu, wake_flag = WF_TTWU;
>> -
>> -       lockdep_assert_rq_held(rq);
>> -       WARN_ON(p == rq->curr);
>> -
>> -       if (p == rq->donor)
>> -               proxy_resched_idle(rq);
>> -
>> -       proxy_release_rq_lock(rq, rf);
>> -       /*
>> -        * We drop the rq lock, and re-grab task_rq_lock to get
>> -        * the pi_lock (needed for select_task_rq) as well.
>> -        */
>> -       scoped_guard (task_rq_lock, p) {
>> -               task_rq = scope.rq;
>> -
>> -               /*
>> -                * Since we let go of the rq lock, the task may have been
>> -                * woken or migrated to another rq before we  got the
>> -                * task_rq_lock. So re-check we're on the same RQ. If
>> -                * not, the task has already been migrated and that CPU
>> -                * will handle any futher migrations.
>> -                */
>> -               if (task_rq != rq)
>> -                       break;
>> -
>> -               /*
>> -                * Similarly, if we've been dequeued, someone else will
>> -                * wake us
>> -                */
>> -               if (!task_on_rq_queued(p))
>> -                       break;
>> -
>> -               /*
>> -                * Since we should only be calling here from __schedule()
>> -                * -> find_proxy_task(), no one else should have
>> -                * assigned current out from under us. But check and warn
>> -                * if we see this, then bail.
>> -                */
>> -               if (task_current(task_rq, p) || task_on_cpu(task_rq, p)) {
>> -                       WARN_ONCE(1, "%s rq: %i current/on_cpu task %s %d  on_cpu: %i\n",
>> -                                 __func__, cpu_of(task_rq),
>> -                                 p->comm, p->pid, p->on_cpu);
>> -                       break;
>> -               }
>> -
>> -               update_rq_clock(task_rq);
>> -               deactivate_task(task_rq, p, DEQUEUE_NOCLOCK);
>> -               cpu = select_task_rq(p, p->wake_cpu, &wake_flag);
>> -               set_task_cpu(p, cpu);
>> -               target_rq = cpu_rq(cpu);
>> -               clear_task_blocked_on(p, NULL);
>> -       }
>> -
>> -       if (target_rq)
>> -               attach_one_task(target_rq, p);
>> -
>> -       proxy_reacquire_rq_lock(rq, rf);
>> -}
>> -

Went a little heavy of the delete there did you? :-)

>>  /*
>>   * Find runnable lock owner to proxy for mutex blocked donor
>>   *
>> @@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
>>                                 clear_task_blocked_on(p, PROXY_WAKING);
>>                                 return p;
>>                         }
>> -                       goto force_return;
>> +                       goto deactivate;
>>                 }

This makes sense if we preserve the !TASK_RUNNING + p->blocked_on
invariant since we'll definitely get a wakeup here.

>>
>>                 /*
>> @@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
>>                                 __clear_task_blocked_on(p, NULL);
>>                                 return p;
>>                         }
>> -                       goto force_return;
>> +                       goto deactivate;

This too makes sense considering !owner implies some task will be woken
up but ... if we take this task off and another task steals the mutex,
this task will no longer be able to proxy it since it is completely
blocked now.

Probably not desired. We should at least let it run and see if it can
get the mutex and evaluate the "p->blocked_on" again since !owner is
a limbo state.

>>                 }
>>
>>                 if (!READ_ONCE(owner->on_rq) || owner->se.sched_delayed) {
>> @@ -6891,12 +6837,7 @@ find_proxy_task(struct rq *rq, struct ta
>>         return owner;
>>
>>  deactivate:
>> -       if (proxy_deactivate(rq, donor))
>> -               return NULL;
>> -       /* If deactivate fails, force return */
>> -       p = donor;
>> -force_return:
>> -       proxy_force_return(rq, rf, p);
>> +       proxy_deactivate(rq, donor);
>>         return NULL;
>>  migrate_task:
>>         proxy_migrate_task(rq, rf, p, owner_cpu);
> 
> So I like getting rid of proxy_force_return(), but its not clear to me
> that proxy_deactivate() is what we want to do in these
> find_proxy_task() edge cases.
> 
> It feels like if we are already racing with ttwu, deactivating the
> task seems like it might open more windows where we might lose the
> wakeup.
> 
> In fact, the whole reason we have proxy_force_return() is that earlier
> in the proxy-exec development, when we hit those edge cases we usually
> would return proxy_reschedule_idle() just to drop the rq lock and let
> ttwu do its thing, but there kept on being cases where we would end up
> with lost wakeups.
> 
> But I'll give this a shot (and will integrate your ttwu_runnable
> cleanups regardless) and see how it does.

So I added the following in top of Peter's diff on top of
queue:sched/core and it hasn't crashed and burnt yet when running a
handful instances of sched-messaging with a mix of fair and SCHED_RR
priority:

  (Includes John's findings from the parallel thread)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b2b2451720a..e845e3a8ae65 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2160,7 +2160,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
 	dequeue_task(rq, p, flags);
 }
 
-static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
+static void block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
 {
 	int flags = DEQUEUE_NOCLOCK;
 
@@ -3696,6 +3696,20 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 	}
 }
 
+static void zap_balance_callbacks(struct rq *rq);
+
+static inline void proxy_reset_donor(struct rq *rq)
+{
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	WARN_ON_ONCE(rq->curr == rq->donor);
+
+	put_prev_set_next_task(rq, rq->donor, rq->curr);
+	rq_set_donor(rq, rq->curr);
+	zap_balance_callbacks(rq);
+	resched_curr(rq);
+#endif
+}
+
 /*
  * Consider @p being inside a wait loop:
  *
@@ -3730,6 +3744,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 		return 0;
 
 	update_rq_clock(rq);
+	if (p->se.sched_delayed)
+		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
 	if (sched_proxy_exec() && p->blocked_on) {
 		guard(raw_spinlock)(&p->blocked_lock);
 		struct mutex *lock = p->blocked_on;
@@ -3738,15 +3754,20 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
 			 * TASK_WAKING is a special state and results in
 			 * DEQUEUE_SPECIAL such that the task will actually be
 			 * forced from the runqueue.
+			 *
+			 * XXX: All of this is now equivalent of
+			 * proxy_needs_return() from John's series :-)
 			 */
-			block_task(rq, p, TASK_WAKING);
 			p->blocked_on = NULL;
+			if (task_current(rq, p))
+				goto out;
+			if (task_current_donor(rq, p))
+				proxy_reset_donor(rq);
+			block_task(rq, p, TASK_WAKING);
 			return 0;
 		}
 	}
-
-	if (p->se.sched_delayed)
-		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
+out:
 	if (!task_on_cpu(rq, p)) {
 		/*
 		 * When on_rq && !on_cpu the task is preempted, see if
@@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		 */
 		smp_cond_load_acquire(&p->on_cpu, !VAL);
 
+		/*
+		 * We never clear the blocked_on relation on proxy_deactivate.
+		 * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
+		 * when waking up. Since this is a fully blocked, off CPU task
+		 * waking up, it should be safe to clear the blocked_on relation.
+		 */
+		if (task_is_blocked(p))
+			clear_task_blocked_on(p, NULL);
+
 		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
 		if (task_cpu(p) != cpu) {
 			if (p->in_iowait) {
@@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode)
 		switch_count = &prev->nvcsw;
 	}
 
+	/* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */
+	if (!prev_state && task_is_blocked(prev))
+		clear_task_blocked_on(prev, NULL);
+
 pick_again:
 	assert_balance_callbacks_empty(rq);
 	next = pick_next_task(rq, rq->donor, &rf);
---

Now, there are obviously some sharp edges that I have highlighted above
which may affect performance and correctness to some extent but once we
have sleeping owner bits, it should all go away.

Anyways, I'll let you bash me now on why that try_to_wake_up() hunk
might be totally wrong and dangerous :-)
 
-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-03  6:09               ` K Prateek Nayak
@ 2026-04-03  9:52                 ` Peter Zijlstra
  2026-04-03 10:25                   ` K Prateek Nayak
  0 siblings, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2026-04-03  9:52 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Apr 03, 2026 at 11:39:25AM +0530, K Prateek Nayak wrote:

> >> @@ -3702,28 +3723,39 @@ ttwu_do_activate(struct rq *rq, struct t
> >>   */
> >>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
> >>  {
> >> -       struct rq_flags rf;
> >> -       struct rq *rq;
> >> -       int ret = 0;
> >> +       ACQUIRE(__task_rq_lock, guard)(p);
> >> +       struct rq *rq = guard.rq;
> >>
> >> -       rq = __task_rq_lock(p, &rf);
> >> -       if (task_on_rq_queued(p)) {
> >> -               update_rq_clock(rq);
> >> -               if (p->se.sched_delayed)
> >> -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> >> -               if (!task_on_cpu(rq, p)) {
> >> +       if (!task_on_rq_queued(p))
> >> +               return 0;
> >> +
> >> +       if (sched_proxy_exec() && p->blocked_on) {
> >> +               guard(raw_spinlock)(&p->blocked_lock);
> >> +               struct mutex *lock = p->blocked_on;
> >> +               if (lock) {
> >>                         /*
> >> -                        * When on_rq && !on_cpu the task is preempted, see if
> >> -                        * it should preempt the task that is current now.
> >> +                        * TASK_WAKING is a special state and results in
> >> +                        * DEQUEUE_SPECIAL such that the task will actually be
> >> +                        * forced from the runqueue.
> >>                          */
> >> -                       wakeup_preempt(rq, p, wake_flags);
> >> +                       block_task(rq, p, TASK_WAKING);
> 
> This needs to reset the rq->donor if the task getting woken up is the
> current donor.

*groan*, that is a fun case. I'll ponder that.

> >> +                       p->blocked_on = NULL;
> >> +                       return 0;
> >>                 }
> >> -               ttwu_do_wakeup(p);
> >> -               ret = 1;
> >>         }
> >> -       __task_rq_unlock(rq, p, &rf);
> >>
> >> -       return ret;
> >> +       update_rq_clock(rq);
> 
> nit. Since block_task() adds a DEQUEUE_NOCLOCK now we need to move that
> clock update before the block.

D'0h :-)

> >> +       if (p->se.sched_delayed)
> >> +               enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> > 
> > I can't precisely remember the details now, but I believe we need to
> > handle enqueueing sched_delayed tasks before handling blocked_on
> > tasks.
> 
> So proxy_deactivate() can still delay the task leading to
> task_on_rq_queued() and the wakeup coming to ttwu_runnable() so either
> we can dequeue it fully in proxy_deactivate() or we need to teach
> block_task() to add a DEQUEUE_DELAYED flag when task_is_blocked().
> 
> I think the former is cleaner but we don't decay lag for fair task :-(
> 
> We can't simply re-enqueue it either since proxy migration might have
> put it on a CPU outside its affinity mask so we need to take a full
> dequeue + wakeup in ttwu_runnable().

Right, sanest option is to have ttwu_runnable() deal with this.

> >> -static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
> >> -                              struct task_struct *p)
> >> -       __must_hold(__rq_lockp(rq))
> >> -{
> >> -}
> >> -
> 
> Went a little heavy of the delete there did you? :-)

Well, I thought that was the whole idea, have ttwu() handle this :-)

> >>  /*
> >>   * Find runnable lock owner to proxy for mutex blocked donor
> >>   *
> >> @@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
> >>                                 clear_task_blocked_on(p, PROXY_WAKING);
> >>                                 return p;
> >>                         }
> >> -                       goto force_return;
> >> +                       goto deactivate;
> >>                 }
> 
> This makes sense if we preserve the !TASK_RUNNING + p->blocked_on
> invariant since we'll definitely get a wakeup here.

Right, so TASK_RUNNING must imply !->blocked_on.

> >>
> >>                 /*
> >> @@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
> >>                                 __clear_task_blocked_on(p, NULL);
> >>                                 return p;
> >>                         }
> >> -                       goto force_return;
> >> +                       goto deactivate;
> 
> This too makes sense considering !owner implies some task will be woken
> up but ... if we take this task off and another task steals the mutex,
> this task will no longer be able to proxy it since it is completely
> blocked now.
> 
> Probably not desired. We should at least let it run and see if it can
> get the mutex and evaluate the "p->blocked_on" again since !owner is
> a limbo state.

I need to go re-read the mutex side of things, but doesn't that do
hand-off way more agressively?

Anyway, one thing that is completely missing is a fast path for when the
task is still inside its valid mask. I suspect adding that back will
cure some of these issues.

> So I added the following in top of Peter's diff on top of
> queue:sched/core and it hasn't crashed and burnt yet when running a
> handful instances of sched-messaging with a mix of fair and SCHED_RR
> priority:
> 
>   (Includes John's findings from the parallel thread)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5b2b2451720a..e845e3a8ae65 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2160,7 +2160,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
>  	dequeue_task(rq, p, flags);
>  }
>  
> -static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
> +static void block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
>  {
>  	int flags = DEQUEUE_NOCLOCK;
>  
> @@ -3696,6 +3696,20 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>  	}
>  }
>  
> +static void zap_balance_callbacks(struct rq *rq);
> +
> +static inline void proxy_reset_donor(struct rq *rq)
> +{
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +	WARN_ON_ONCE(rq->curr == rq->donor);
> +
> +	put_prev_set_next_task(rq, rq->donor, rq->curr);
> +	rq_set_donor(rq, rq->curr);
> +	zap_balance_callbacks(rq);
> +	resched_curr(rq);
> +#endif
> +}

This one hurts my bain :-)

>  /*
>   * Consider @p being inside a wait loop:
>   *
> @@ -3730,6 +3744,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  		return 0;
>  
>  	update_rq_clock(rq);
> +	if (p->se.sched_delayed)
> +		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);

Right, this works but seems wasteful, might be better to add
DEQUEUE_DELAYED in the blocked_on case.

>  	if (sched_proxy_exec() && p->blocked_on) {

So I had doubts about this lockless test of ->blocked_on, I still cannot
convince myself it is correct.

>  		guard(raw_spinlock)(&p->blocked_lock);
>  		struct mutex *lock = p->blocked_on;
> @@ -3738,15 +3754,20 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  			 * TASK_WAKING is a special state and results in
>  			 * DEQUEUE_SPECIAL such that the task will actually be
>  			 * forced from the runqueue.
> +			 *
> +			 * XXX: All of this is now equivalent of
> +			 * proxy_needs_return() from John's series :-)
>  			 */
> -			block_task(rq, p, TASK_WAKING);
>  			p->blocked_on = NULL;
> +			if (task_current(rq, p))
> +				goto out;

Right, fair enough :-) This could also be done when rq->cpu is inside
p->cpus_ptr mask, because in that case we don't strictly need a
migration. Thinking about that was on the todo list.

> +			if (task_current_donor(rq, p))
> +				proxy_reset_donor(rq);

Fun fun fun :-)

> +			block_task(rq, p, TASK_WAKING);
>  			return 0;
>  		}
>  	}
> -
> -	if (p->se.sched_delayed)
> -		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> +out:
>  	if (!task_on_cpu(rq, p)) {
>  		/*
>  		 * When on_rq && !on_cpu the task is preempted, see if
> @@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>  		 */
>  		smp_cond_load_acquire(&p->on_cpu, !VAL);
>  
> +		/*
> +		 * We never clear the blocked_on relation on proxy_deactivate.
> +		 * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
> +		 * when waking up. Since this is a fully blocked, off CPU task
> +		 * waking up, it should be safe to clear the blocked_on relation.
> +		 */
> +		if (task_is_blocked(p))
> +			clear_task_blocked_on(p, NULL);
> +

Aah, yes! This is when find_proxy_task() hits deactivate() for us and we
skip ttwu_runnable(). We still need to clear ->blocked_on.

I am once again not sure on the lockless nature of accessing
->blocked_on.

>  		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
>  		if (task_cpu(p) != cpu) {
>  			if (p->in_iowait) {
> @@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode)
>  		switch_count = &prev->nvcsw;
>  	}
>  
> +	/* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */
> +	if (!prev_state && task_is_blocked(prev))
> +		clear_task_blocked_on(prev, NULL);
> +

This one confuses me, ttwu() should never results in ->blocked_on being
set.

>  pick_again:
>  	assert_balance_callbacks_empty(rq);
>  	next = pick_next_task(rq, rq->donor, &rf);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-03  9:52                 ` Peter Zijlstra
@ 2026-04-03 10:25                   ` K Prateek Nayak
  2026-04-03 11:28                     ` Peter Zijlstra
  2026-04-03 12:54                     ` Peter Zijlstra
  0 siblings, 2 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-04-03 10:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello Peter,

On 4/3/2026 3:22 PM, Peter Zijlstra wrote:
>>>> +       if (p->se.sched_delayed)
>>>> +               enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>>>
>>> I can't precisely remember the details now, but I believe we need to
>>> handle enqueueing sched_delayed tasks before handling blocked_on
>>> tasks.
>>
>> So proxy_deactivate() can still delay the task leading to
>> task_on_rq_queued() and the wakeup coming to ttwu_runnable() so either
>> we can dequeue it fully in proxy_deactivate() or we need to teach
>> block_task() to add a DEQUEUE_DELAYED flag when task_is_blocked().
>>
>> I think the former is cleaner but we don't decay lag for fair task :-(
>>
>> We can't simply re-enqueue it either since proxy migration might have
>> put it on a CPU outside its affinity mask so we need to take a full
>> dequeue + wakeup in ttwu_runnable().
> 
> Right, sanest option is to have ttwu_runnable() deal with this.

Ack! For now I've used John's original move of doing re-enqueue
before doing a dequeue if we find a delayed + blocked_on task.

> 
>>>> -static void proxy_force_return(struct rq *rq, struct rq_flags *rf,
>>>> -                              struct task_struct *p)
>>>> -       __must_hold(__rq_lockp(rq))
>>>> -{
>>>> -}
>>>> -
>>
>> Went a little heavy of the delete there did you? :-)
> 
> Well, I thought that was the whole idea, have ttwu() handle this :-)
> 
>>>>  /*
>>>>   * Find runnable lock owner to proxy for mutex blocked donor
>>>>   *
>>>> @@ -6777,7 +6723,7 @@ find_proxy_task(struct rq *rq, struct ta
>>>>                                 clear_task_blocked_on(p, PROXY_WAKING);
>>>>                                 return p;
>>>>                         }
>>>> -                       goto force_return;
>>>> +                       goto deactivate;
>>>>                 }
>>
>> This makes sense if we preserve the !TASK_RUNNING + p->blocked_on
>> invariant since we'll definitely get a wakeup here.
> 
> Right, so TASK_RUNNING must imply !->blocked_on.
> 
>>>>
>>>>                 /*
>>>> @@ -6812,7 +6758,7 @@ find_proxy_task(struct rq *rq, struct ta
>>>>                                 __clear_task_blocked_on(p, NULL);
>>>>                                 return p;
>>>>                         }
>>>> -                       goto force_return;
>>>> +                       goto deactivate;
>>
>> This too makes sense considering !owner implies some task will be woken
>> up but ... if we take this task off and another task steals the mutex,
>> this task will no longer be able to proxy it since it is completely
>> blocked now.
>>
>> Probably not desired. We should at least let it run and see if it can
>> get the mutex and evaluate the "p->blocked_on" again since !owner is
>> a limbo state.
> 
> I need to go re-read the mutex side of things, but doesn't that do
> hand-off way more agressively?

Ack but we have optimistic spinning enabled for performance reasons so
there is still a chance that the task may not get the mutex but now that
I think about it, it will definitely receive a wakeup so it it should be
able to re-establish the chain when it gets on CPU again.

> 
> Anyway, one thing that is completely missing is a fast path for when the
> task is still inside its valid mask. I suspect adding that back will
> cure some of these issues.
> 
>> So I added the following in top of Peter's diff on top of
>> queue:sched/core and it hasn't crashed and burnt yet when running a
>> handful instances of sched-messaging with a mix of fair and SCHED_RR
>> priority:
>>
>>   (Includes John's findings from the parallel thread)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 5b2b2451720a..e845e3a8ae65 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2160,7 +2160,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
>>  	dequeue_task(rq, p, flags);
>>  }
>>  
>> -static bool block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
>> +static void block_task(struct rq *rq, struct task_struct *p, unsigned long task_state)
>>  {
>>  	int flags = DEQUEUE_NOCLOCK;
>>  
>> @@ -3696,6 +3696,20 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>>  	}
>>  }
>>  
>> +static void zap_balance_callbacks(struct rq *rq);
>> +
>> +static inline void proxy_reset_donor(struct rq *rq)
>> +{
>> +#ifdef CONFIG_SCHED_PROXY_EXEC
>> +	WARN_ON_ONCE(rq->curr == rq->donor);
>> +
>> +	put_prev_set_next_task(rq, rq->donor, rq->curr);
>> +	rq_set_donor(rq, rq->curr);
>> +	zap_balance_callbacks(rq);
>> +	resched_curr(rq);
>> +#endif
>> +}
> 
> This one hurts my bain :-)
> 
>>  /*
>>   * Consider @p being inside a wait loop:
>>   *
>> @@ -3730,6 +3744,8 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>>  		return 0;
>>  
>>  	update_rq_clock(rq);
>> +	if (p->se.sched_delayed)
>> +		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> 
> Right, this works but seems wasteful, might be better to add
> DEQUEUE_DELAYED in the blocked_on case.
> 
>>  	if (sched_proxy_exec() && p->blocked_on) {
> 
> So I had doubts about this lockless test of ->blocked_on, I still cannot
> convince myself it is correct.

Let me give a try: A task's "blocked_on" starts off as a valid mutex and
can be transitioned optionally to PROXY_WAKING (!= NULL) before being
cleared.

If blocked_on is cleared directly, PROXY_WAKING transition never
happens even if someone does set_task_blocked_on_waking() since we bail
out early if !p->blocked_on.

All "p->blocked_on" transition happen with "blocked_on_lock" held.

So that begs the question, when is "blocked_on" actually cleared?

1) If the task is task_on_rq_queued(), we either clear it in schedule()
   (find_proxy_task() to be precise) or in ttwu_runnable() - both with
   rq_lock held.

2) *NEW* If the task is off rq and is waking up, it means there is a
   ttwu_state_match() and without proxy, the task would have woken up
   and executed on the CPU.

   Since the task is completely off rq, schedule() cannot clear the
   p->blocked_on. Only other remote transition possible is to
   PROXY_WAKING (!= NULL).

   So *inspecting* the p->blocked_on relation without the
   blocked_on_lock held should be fine to know if the task has a
   blocked_on relation.

Only the task itself can set "p->blocked_on" to a valid mutex when
running on the CPU so it is out of question we can suddenly get a
transition to a new mutex when we are in schedule() or in middle of
waking the task.

> 
>>  		guard(raw_spinlock)(&p->blocked_lock);
>>  		struct mutex *lock = p->blocked_on;
>> @@ -3738,15 +3754,20 @@ static int ttwu_runnable(struct task_struct *p, int wake_flags)
>>  			 * TASK_WAKING is a special state and results in
>>  			 * DEQUEUE_SPECIAL such that the task will actually be
>>  			 * forced from the runqueue.
>> +			 *
>> +			 * XXX: All of this is now equivalent of
>> +			 * proxy_needs_return() from John's series :-)
>>  			 */
>> -			block_task(rq, p, TASK_WAKING);
>>  			p->blocked_on = NULL;
>> +			if (task_current(rq, p))
>> +				goto out;
> 
> Right, fair enough :-) This could also be done when rq->cpu is inside
> p->cpus_ptr mask, because in that case we don't strictly need a
> migration. Thinking about that was on the todo list.

Ack. Once concern there is that task was out of load balancer's
purview until it is "p->blocked_on" and this could be a good
spot for balance during wakeup.

> 
>> +			if (task_current_donor(rq, p))
>> +				proxy_reset_donor(rq);
> 
> Fun fun fun :-)
> 
>> +			block_task(rq, p, TASK_WAKING);
>>  			return 0;
>>  		}
>>  	}
>> -
>> -	if (p->se.sched_delayed)
>> -		enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
>> +out:
>>  	if (!task_on_cpu(rq, p)) {
>>  		/*
>>  		 * When on_rq && !on_cpu the task is preempted, see if
>> @@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>  		 */
>>  		smp_cond_load_acquire(&p->on_cpu, !VAL);
>>  
>> +		/*
>> +		 * We never clear the blocked_on relation on proxy_deactivate.
>> +		 * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
>> +		 * when waking up. Since this is a fully blocked, off CPU task
>> +		 * waking up, it should be safe to clear the blocked_on relation.
>> +		 */
>> +		if (task_is_blocked(p))
>> +			clear_task_blocked_on(p, NULL);
>> +
> 
> Aah, yes! This is when find_proxy_task() hits deactivate() for us and we
> skip ttwu_runnable(). We still need to clear ->blocked_on.
> 
> I am once again not sure on the lockless nature of accessing
> ->blocked_on.

I hope I have convinced you from the short analysis above :-)

> 
>>  		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
>>  		if (task_cpu(p) != cpu) {
>>  			if (p->in_iowait) {
>> @@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode)
>>  		switch_count = &prev->nvcsw;
>>  	}
>>  
>> +	/* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */
>> +	if (!prev_state && task_is_blocked(prev))
>> +		clear_task_blocked_on(prev, NULL);
>> +
> 
> This one confuses me, ttwu() should never results in ->blocked_on being
> set.

This is from the signal_pending_state() in try_to_block_task() putting
prev to TASK_RUNNING while still having p->blocked_on.

It is expected that task executes and re-evaluates if it needs to block
on the mutex again or simply return -EINTR from
mutex_lock_interruptible().

> 
>>  pick_again:
>>  	assert_balance_callbacks_empty(rq);
>>  	next = pick_next_task(rq, rq->donor, &rf);

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-03 10:25                   ` K Prateek Nayak
@ 2026-04-03 11:28                     ` Peter Zijlstra
  2026-04-03 13:43                       ` K Prateek Nayak
  2026-04-03 12:54                     ` Peter Zijlstra
  1 sibling, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2026-04-03 11:28 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Apr 03, 2026 at 03:55:22PM +0530, K Prateek Nayak wrote:
> >> @@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >>  		 */
> >>  		smp_cond_load_acquire(&p->on_cpu, !VAL);
> >>  
> >> +		/*
> >> +		 * We never clear the blocked_on relation on proxy_deactivate.
> >> +		 * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
> >> +		 * when waking up. Since this is a fully blocked, off CPU task
> >> +		 * waking up, it should be safe to clear the blocked_on relation.
> >> +		 */
> >> +		if (task_is_blocked(p))
> >> +			clear_task_blocked_on(p, NULL);
> >> +
> > 
> > Aah, yes! This is when find_proxy_task() hits deactivate() for us and we
> > skip ttwu_runnable(). We still need to clear ->blocked_on.

I wonder, should we have proxy_deactivate() do this instead?

> > 
> >>  		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
> >>  		if (task_cpu(p) != cpu) {
> >>  			if (p->in_iowait) {
> >> @@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode)
> >>  		switch_count = &prev->nvcsw;
> >>  	}
> >>  
> >> +	/* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */
> >> +	if (!prev_state && task_is_blocked(prev))
> >> +		clear_task_blocked_on(prev, NULL);
> >> +
> > 
> > This one confuses me, ttwu() should never results in ->blocked_on being
> > set.
> 
> This is from the signal_pending_state() in try_to_block_task() putting
> prev to TASK_RUNNING while still having p->blocked_on.

Ooh, I misread that. I saw that set_task_blocked_on_waking(, NULL) in
there and though it would clear. Damn, sometimes reading is so very
hard...

Anyhow, with my changes on, try_to_block_task() is only ever called from
__schedule() on current. This means this could in fact be
clear_task_blocked_on(), right?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-03 11:28                     ` Peter Zijlstra
@ 2026-04-03 13:43                       ` K Prateek Nayak
  2026-04-03 14:38                         ` Peter Zijlstra
  0 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-04-03 13:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello Peter,

On 4/3/2026 4:58 PM, Peter Zijlstra wrote:
> On Fri, Apr 03, 2026 at 03:55:22PM +0530, K Prateek Nayak wrote:
>>>> @@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>>>  		 */
>>>>  		smp_cond_load_acquire(&p->on_cpu, !VAL);
>>>>  
>>>> +		/*
>>>> +		 * We never clear the blocked_on relation on proxy_deactivate.
>>>> +		 * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
>>>> +		 * when waking up. Since this is a fully blocked, off CPU task
>>>> +		 * waking up, it should be safe to clear the blocked_on relation.
>>>> +		 */
>>>> +		if (task_is_blocked(p))
>>>> +			clear_task_blocked_on(p, NULL);
>>>> +
>>>
>>> Aah, yes! This is when find_proxy_task() hits deactivate() for us and we
>>> skip ttwu_runnable(). We still need to clear ->blocked_on.
> 
> I wonder, should we have proxy_deactivate() do this instead?

That is one way to tackle that, yes!

> 
>>>
>>>>  		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
>>>>  		if (task_cpu(p) != cpu) {
>>>>  			if (p->in_iowait) {
>>>> @@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode)
>>>>  		switch_count = &prev->nvcsw;
>>>>  	}
>>>>  
>>>> +	/* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */
>>>> +	if (!prev_state && task_is_blocked(prev))
>>>> +		clear_task_blocked_on(prev, NULL);
>>>> +
>>>
>>> This one confuses me, ttwu() should never results in ->blocked_on being
>>> set.
>>
>> This is from the signal_pending_state() in try_to_block_task() putting
>> prev to TASK_RUNNING while still having p->blocked_on.
> 
> Ooh, I misread that. I saw that set_task_blocked_on_waking(, NULL) in
> there and though it would clear. Damn, sometimes reading is so very
> hard...
> 
> Anyhow, with my changes on, try_to_block_task() is only ever called from
> __schedule() on current. This means this could in fact be
> clear_task_blocked_on(), right?

Yes, that should work too but there is also the case you mentioned on
the parallel thread - if ttwu() doesn't see p->blocked_on, it won't
clear it and if ttwu_runnable() wins we can simply go into
__schedule() with TASK_RUNNING + p->blocked_on with no guarantee that
same task will be picked again. Then the task gets preempted and stays
in an illegal state.

Clearing of ->blocked_on in schedule also safeguards against that race:

o Scenario 1: ttwu_runnable() wins

  CPU0				CPU1

  LOCK				ACQUIRE
  [W] ->blocked_on = lock       [R] ->__state
  [W] ->__state = state;        RMB
                                [R] ->blocked_on
                                [W] ->__state = RUNNING

           /* Synchronized by rq_lock */

  MB
  [R] if (!->__state &&
  [R]     ->blocked_on)
  [W]   ->blocked_on = NULL; /* Safeguard */


o Scenario 2: __schedule() wins

  CPU0				CPU1

  LOCK				ACQUIRE
  [W] ->blocked_on = lock       [R] ->__state
  [W] ->__state = state;

           /* Synchronized by rq_lock */
  MB
  [W]  __block_task()

                                MB
                                /* Full wakeup. */
                                [R] if (->blocked_on)
                                [W]   ->blocked_on = NULL

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-03 13:43                       ` K Prateek Nayak
@ 2026-04-03 14:38                         ` Peter Zijlstra
  2026-04-03 15:39                           ` K Prateek Nayak
  0 siblings, 1 reply; 56+ messages in thread
From: Peter Zijlstra @ 2026-04-03 14:38 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Apr 03, 2026 at 07:13:29PM +0530, K Prateek Nayak wrote:
> Hello Peter,
> 
> On 4/3/2026 4:58 PM, Peter Zijlstra wrote:
> > On Fri, Apr 03, 2026 at 03:55:22PM +0530, K Prateek Nayak wrote:
> >>>> @@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >>>>  		 */
> >>>>  		smp_cond_load_acquire(&p->on_cpu, !VAL);
> >>>>  
> >>>> +		/*
> >>>> +		 * We never clear the blocked_on relation on proxy_deactivate.
> >>>> +		 * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
> >>>> +		 * when waking up. Since this is a fully blocked, off CPU task
> >>>> +		 * waking up, it should be safe to clear the blocked_on relation.
> >>>> +		 */
> >>>> +		if (task_is_blocked(p))
> >>>> +			clear_task_blocked_on(p, NULL);
> >>>> +
> >>>
> >>> Aah, yes! This is when find_proxy_task() hits deactivate() for us and we
> >>> skip ttwu_runnable(). We still need to clear ->blocked_on.
> > 
> > I wonder, should we have proxy_deactivate() do this instead?
> 
> That is one way to tackle that, yes!

OK, lets put it there. At that site we already know task_is_blocked()
and we get less noise in the wakeup path.

Or should we perhaps put it in block_task() itself? The moment you're
off the runqueue, ->blocked_on becomes meaningless.

> >>>
> >>>>  		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
> >>>>  		if (task_cpu(p) != cpu) {
> >>>>  			if (p->in_iowait) {
> >>>> @@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode)
> >>>>  		switch_count = &prev->nvcsw;
> >>>>  	}
> >>>>  
> >>>> +	/* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */
> >>>> +	if (!prev_state && task_is_blocked(prev))
> >>>> +		clear_task_blocked_on(prev, NULL);
> >>>> +
> >>>
> >>> This one confuses me, ttwu() should never results in ->blocked_on being
> >>> set.
> >>
> >> This is from the signal_pending_state() in try_to_block_task() putting
> >> prev to TASK_RUNNING while still having p->blocked_on.
> > 
> > Ooh, I misread that. I saw that set_task_blocked_on_waking(, NULL) in
> > there and though it would clear. Damn, sometimes reading is so very
> > hard...
> > 
> > Anyhow, with my changes on, try_to_block_task() is only ever called from
> > __schedule() on current. This means this could in fact be
> > clear_task_blocked_on(), right?
> 
> Yes, that should work too but there is also the case you mentioned on
> the parallel thread - if ttwu() doesn't see p->blocked_on, it won't
> clear it and if ttwu_runnable() wins we can simply go into
> __schedule() with TASK_RUNNING + p->blocked_on with no guarantee that
> same task will be picked again. Then the task gets preempted and stays
> in an illegal state.
> 
> Clearing of ->blocked_on in schedule also safeguards against that race:
> 
> o Scenario 1: ttwu_runnable() wins
> 
>   CPU0				CPU1
> 
>   LOCK				ACQUIRE
>   [W] ->blocked_on = lock       [R] ->__state
>   [W] ->__state = state;        RMB
>                                 [R] ->blocked_on
>                                 [W] ->__state = RUNNING
> 
>            /* Synchronized by rq_lock */
> 
>   MB
>   [R] if (!->__state &&
>   [R]     ->blocked_on)
>   [W]   ->blocked_on = NULL; /* Safeguard */
> 
> 
> o Scenario 2: __schedule() wins
> 
>   CPU0				CPU1
> 
>   LOCK				ACQUIRE
>   [W] ->blocked_on = lock       [R] ->__state
>   [W] ->__state = state;
> 
>            /* Synchronized by rq_lock */
>   MB
>   [W]  __block_task()
> 
>                                 MB
>                                 /* Full wakeup. */
>                                 [R] if (->blocked_on)
>                                 [W]   ->blocked_on = NULL
> 

Oh Boohoo :-( Yes, you're quite right.

Is find_proxy_task() that is affected, so this fixup should
perhaps live in the existing sched_proxy_exec() branch?

Something like so?


--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2222,6 +2222,13 @@ static bool block_task(struct rq *rq, st
 {
 	int flags = DEQUEUE_NOCLOCK;
 
+	/*
+	 * We're being taken off the runqueue, cannot still be blocked_on
+	 * anything. This also means that delay_dequeue can not have
+	 * blocked_on.
+	 */
+	clear_task_blocked_on(p, NULL);
+	
 	p->sched_contributes_to_load =
 		(task_state & TASK_UNINTERRUPTIBLE) &&
 		!(task_state & TASK_NOLOAD) &&
@@ -6614,7 +6621,7 @@ static bool try_to_block_task(struct rq
 	if (signal_pending_state(task_state, p)) {
 		WRITE_ONCE(p->__state, TASK_RUNNING);
 		*task_state_p = TASK_RUNNING;
-		set_task_blocked_on_waking(p, NULL);
+		clear_task_blocked_on(p, NULL);
 
 		return false;
 	}
@@ -7043,6 +7050,14 @@ static void __sched notrace __schedule(i
 	if (sched_proxy_exec()) {
 		struct task_struct *prev_donor = rq->donor;
 
+		/*
+		 * There is a race between ttwu() and __mutex_lock_common()
+		 * where it is possible for the mutex code to call into
+		 * schedule() with ->blocked_on still set.
+		 */
+		if (!prev_state && prev->blocked_on)
+			clear_task_blocked_on(prev, NULL);
+
 		rq_set_donor(rq, next);
 		if (unlikely(next->blocked_on)) {
 			next = find_proxy_task(rq, next, &rf);


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-03 14:38                         ` Peter Zijlstra
@ 2026-04-03 15:39                           ` K Prateek Nayak
  2026-04-03 21:08                             ` Peter Zijlstra
  2026-04-04  0:26                             ` John Stultz
  0 siblings, 2 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-04-03 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello Peter,

On 4/3/2026 8:08 PM, Peter Zijlstra wrote:
> On Fri, Apr 03, 2026 at 07:13:29PM +0530, K Prateek Nayak wrote:
>> Hello Peter,
>>
>> On 4/3/2026 4:58 PM, Peter Zijlstra wrote:
>>> On Fri, Apr 03, 2026 at 03:55:22PM +0530, K Prateek Nayak wrote:
>>>>>> @@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>>>>>>  		 */
>>>>>>  		smp_cond_load_acquire(&p->on_cpu, !VAL);
>>>>>>  
>>>>>> +		/*
>>>>>> +		 * We never clear the blocked_on relation on proxy_deactivate.
>>>>>> +		 * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
>>>>>> +		 * when waking up. Since this is a fully blocked, off CPU task
>>>>>> +		 * waking up, it should be safe to clear the blocked_on relation.
>>>>>> +		 */
>>>>>> +		if (task_is_blocked(p))
>>>>>> +			clear_task_blocked_on(p, NULL);
>>>>>> +
>>>>>
>>>>> Aah, yes! This is when find_proxy_task() hits deactivate() for us and we
>>>>> skip ttwu_runnable(). We still need to clear ->blocked_on.
>>>
>>> I wonder, should we have proxy_deactivate() do this instead?
>>
>> That is one way to tackle that, yes!
> 
> OK, lets put it there. At that site we already know task_is_blocked()
> and we get less noise in the wakeup path.
> 
> Or should we perhaps put it in block_task() itself? The moment you're
> off the runqueue, ->blocked_on becomes meaningless.

Ack but I'll have to point you to these next bits in John's tree that
handles sleeping owner
https://github.com/johnstultz-work/linux-dev/commit/255c9e933edf5b86e29f9fbde67738fc5041a862

Essentially, going further, when the blocked_on chain encounters a
blocked owner, they'll block themselves and attach onto the sleeping
owner - when the owner wakes up, the whole chain is activated in one go
restoring proxy.

This is why John has suggested that block_task() is probably not the
right place to clear it since, for the sleeping owner bits, we need to
preserve the blocked_on realation until ttwu().

I have some ideas but let me first see if I can stop them from
exploding on my system :-)

> 
>>>>>
>>>>>>  		cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
>>>>>>  		if (task_cpu(p) != cpu) {
>>>>>>  			if (p->in_iowait) {
>>>>>> @@ -6977,6 +7007,10 @@ static void __sched notrace __schedule(int sched_mode)
>>>>>>  		switch_count = &prev->nvcsw;
>>>>>>  	}
>>>>>>  
>>>>>> +	/* See: https://github.com/kudureranganath/linux/commit/0d6a01bb19db39f045d6f0f5fb4d196500091637 */
>>>>>> +	if (!prev_state && task_is_blocked(prev))
>>>>>> +		clear_task_blocked_on(prev, NULL);
>>>>>> +
>>>>>
>>>>> This one confuses me, ttwu() should never results in ->blocked_on being
>>>>> set.
>>>>
>>>> This is from the signal_pending_state() in try_to_block_task() putting
>>>> prev to TASK_RUNNING while still having p->blocked_on.
>>>
>>> Ooh, I misread that. I saw that set_task_blocked_on_waking(, NULL) in
>>> there and though it would clear. Damn, sometimes reading is so very
>>> hard...
>>>
>>> Anyhow, with my changes on, try_to_block_task() is only ever called from
>>> __schedule() on current. This means this could in fact be
>>> clear_task_blocked_on(), right?
>>
>> Yes, that should work too but there is also the case you mentioned on
>> the parallel thread - if ttwu() doesn't see p->blocked_on, it won't
>> clear it and if ttwu_runnable() wins we can simply go into
>> __schedule() with TASK_RUNNING + p->blocked_on with no guarantee that
>> same task will be picked again. Then the task gets preempted and stays
>> in an illegal state.
>>
>> Clearing of ->blocked_on in schedule also safeguards against that race:
>>
>> o Scenario 1: ttwu_runnable() wins
>>
>>   CPU0				CPU1
>>
>>   LOCK				ACQUIRE
>>   [W] ->blocked_on = lock       [R] ->__state
>>   [W] ->__state = state;        RMB
>>                                 [R] ->blocked_on
>>                                 [W] ->__state = RUNNING
>>
>>            /* Synchronized by rq_lock */
>>
>>   MB
>>   [R] if (!->__state &&
>>   [R]     ->blocked_on)
>>   [W]   ->blocked_on = NULL; /* Safeguard */
>>
>>
>> o Scenario 2: __schedule() wins
>>
>>   CPU0				CPU1
>>
>>   LOCK				ACQUIRE
>>   [W] ->blocked_on = lock       [R] ->__state
>>   [W] ->__state = state;
>>
>>            /* Synchronized by rq_lock */
>>   MB
>>   [W]  __block_task()
>>
>>                                 MB
>>                                 /* Full wakeup. */
>>                                 [R] if (->blocked_on)
>>                                 [W]   ->blocked_on = NULL
>>
> 
> Oh Boohoo :-( Yes, you're quite right.
> 
> Is find_proxy_task() that is affected, so this fixup should
> perhaps live in the existing sched_proxy_exec() branch?
> 
> Something like so?
> 
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2222,6 +2222,13 @@ static bool block_task(struct rq *rq, st
>  {
>  	int flags = DEQUEUE_NOCLOCK;
>  
> +	/*
> +	 * We're being taken off the runqueue, cannot still be blocked_on
> +	 * anything. This also means that delay_dequeue can not have
> +	 * blocked_on.
> +	 */
> +	clear_task_blocked_on(p, NULL);
> +	

Oh! I remember why we couldn't do this - if we clear the
"blocked_on" via proxy_deactivate(), we might be delayed on
the owner's CPU *outside* of the task's affinity.

In that case, we don't want to re-enqueue it there since it'll
violate affinity and the "blocked_on" relation forces it down
the sched_proxy_exec() path in ttwu_runnable() which will fix
it via a full wakeup.

>  	p->sched_contributes_to_load =
>  		(task_state & TASK_UNINTERRUPTIBLE) &&
>  		!(task_state & TASK_NOLOAD) &&
> @@ -6614,7 +6621,7 @@ static bool try_to_block_task(struct rq
>  	if (signal_pending_state(task_state, p)) {
>  		WRITE_ONCE(p->__state, TASK_RUNNING);
>  		*task_state_p = TASK_RUNNING;
> -		set_task_blocked_on_waking(p, NULL);
> +		clear_task_blocked_on(p, NULL);

This is not strictly required - since we set the "*task_state_p" to
TASK_RUNNING which modifies "prev_state" in __schedule() ...

>  
>  		return false;
>  	}
> @@ -7043,6 +7050,14 @@ static void __sched notrace __schedule(i
>  	if (sched_proxy_exec()) {
>  		struct task_struct *prev_donor = rq->donor;
>  
> +		/*
> +		 * There is a race between ttwu() and __mutex_lock_common()
> +		 * where it is possible for the mutex code to call into
> +		 * schedule() with ->blocked_on still set.
> +		 */
> +		if (!prev_state && prev->blocked_on)
> +			clear_task_blocked_on(prev, NULL);

... we'll just end up seeing !prev_state here and clearing it.

We can keep them separate too to make it very explicit. No
strong feelings.

> +
>  		rq_set_donor(rq, next);
>  		if (unlikely(next->blocked_on)) {
>  			next = find_proxy_task(rq, next, &rf);
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-03 15:39                           ` K Prateek Nayak
@ 2026-04-03 21:08                             ` Peter Zijlstra
  2026-04-04  0:26                             ` John Stultz
  1 sibling, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2026-04-03 21:08 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Apr 03, 2026 at 09:09:31PM +0530, K Prateek Nayak wrote:

> > +	/*
> > +	 * We're being taken off the runqueue, cannot still be blocked_on
> > +	 * anything. This also means that delay_dequeue can not have
> > +	 * blocked_on.
> > +	 */
> > +	clear_task_blocked_on(p, NULL);
> > +	
> 
> Oh! I remember why we couldn't do this - if we clear the
> "blocked_on" via proxy_deactivate(), we might be delayed on
> the owner's CPU *outside* of the task's affinity.
> 
> In that case, we don't want to re-enqueue it there since it'll
> violate affinity and the "blocked_on" relation forces it down
> the sched_proxy_exec() path in ttwu_runnable() which will fix
> it via a full wakeup.

Ah indeed.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-03 15:39                           ` K Prateek Nayak
  2026-04-03 21:08                             ` Peter Zijlstra
@ 2026-04-04  0:26                             ` John Stultz
  2026-04-04  5:49                               ` K Prateek Nayak
  1 sibling, 1 reply; 56+ messages in thread
From: John Stultz @ 2026-04-04  0:26 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Apr 3, 2026 at 8:39 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 4/3/2026 8:08 PM, Peter Zijlstra wrote:
> > On Fri, Apr 03, 2026 at 07:13:29PM +0530, K Prateek Nayak wrote:
> >> Hello Peter,
> >>
> >> On 4/3/2026 4:58 PM, Peter Zijlstra wrote:
> >>> On Fri, Apr 03, 2026 at 03:55:22PM +0530, K Prateek Nayak wrote:
> >>>>>> @@ -4256,6 +4277,15 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> >>>>>>                   */
> >>>>>>                  smp_cond_load_acquire(&p->on_cpu, !VAL);
> >>>>>>
> >>>>>> +                /*
> >>>>>> +                 * We never clear the blocked_on relation on proxy_deactivate.
> >>>>>> +                 * If we don't clear it here, we have TASK_RUNNING + p->blocked_on
> >>>>>> +                 * when waking up. Since this is a fully blocked, off CPU task
> >>>>>> +                 * waking up, it should be safe to clear the blocked_on relation.
> >>>>>> +                 */
> >>>>>> +                if (task_is_blocked(p))
> >>>>>> +                        clear_task_blocked_on(p, NULL);
> >>>>>> +
> >>>>>
> >>>>> Aah, yes! This is when find_proxy_task() hits deactivate() for us and we
> >>>>> skip ttwu_runnable(). We still need to clear ->blocked_on.
> >>>
> >>> I wonder, should we have proxy_deactivate() do this instead?
> >>
> >> That is one way to tackle that, yes!
> >
> > OK, lets put it there. At that site we already know task_is_blocked()
> > and we get less noise in the wakeup path.
> >
> > Or should we perhaps put it in block_task() itself? The moment you're
> > off the runqueue, ->blocked_on becomes meaningless.
>
> Ack but I'll have to point you to these next bits in John's tree that
> handles sleeping owner
> https://github.com/johnstultz-work/linux-dev/commit/255c9e933edf5b86e29f9fbde67738fc5041a862
>
> Essentially, going further, when the blocked_on chain encounters a
> blocked owner, they'll block themselves and attach onto the sleeping
> owner - when the owner wakes up, the whole chain is activated in one go
> restoring proxy.
>
> This is why John has suggested that block_task() is probably not the
> right place to clear it since, for the sleeping owner bits, we need to
> preserve the blocked_on realation until ttwu().
>
> I have some ideas but let me first see if I can stop them from
> exploding on my system :-)

Phew, you two are hard to keep up with. :)  I really wanted to get my
v27 set out last night, but then got derailed by the (not
proxy-related) dl_server issue I was seeing in testing.

Anyway, I'd still like to get it out soon, but now I'd really like to
have the approach here included, so...

I'm currently testing with my best guess of the combined suggestions
you've both tossed into this thread. Unfortunately I still trip over
state == TASK_RUNNING in proxy_deactivate(), so I'm trying to debug
that now.
(I think the issue is we hit the ttwu_queue_wakelist() case without
clearing PROXY_WAKING, so we need to clear_task_blocked_on() earlier
in ttwu, likely right after setting TASK_WAKING - that's looking ok so
far).

After I get this into a stable state, I'll try to polish it up a bit
and then re-layer the rest of the proxy-exec series on top (I do fret
the sleeping owner enqueueing will be more complicated).
-john

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-04  0:26                             ` John Stultz
@ 2026-04-04  5:49                               ` K Prateek Nayak
  2026-04-04  6:07                                 ` John Stultz
  0 siblings, 1 reply; 56+ messages in thread
From: K Prateek Nayak @ 2026-04-04  5:49 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 4/4/2026 5:56 AM, John Stultz wrote:
> I'm currently testing with my best guess of the combined suggestions
> you've both tossed into this thread. Unfortunately I still trip over
> state == TASK_RUNNING in proxy_deactivate(), so I'm trying to debug
> that now.
> (I think the issue is we hit the ttwu_queue_wakelist() case without
> clearing PROXY_WAKING, so we need to clear_task_blocked_on() earlier
> in ttwu, likely right after setting TASK_WAKING - that's looking ok so
> far).

That makes sense! I forgot we had that path for "p->on_cpu". Thank you
for chasing that.

Also looking at proxy-exec-v27-WIP, I think that:

    set_task_blocked_on_waking(p, NULL)

in try_to_wake_up() can be moved after ttwu_state_match() since
otherwise, we skip waking the task but ttwu() will end up marking
it PROXY_WAKING for a spurious wakeup.

I don't think there is any harm in keeping it that way but during the
next pick, __schedule() will end up seeing PROXY_WAKING and will block
the task.

Until it receives a genuine wakeup, it cannot participate in proxy which
is a missed opportunity.

> 
> After I get this into a stable state, I'll try to polish it up a bit
> and then re-layer the rest of the proxy-exec series on top (I do fret
> the sleeping owner enqueueing will be more complicated).

Ack! Let me go stare at it for a while but since you say it hasn't
outright crashed your system, you may have taken care of everything
already ;-)

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-04  5:49                               ` K Prateek Nayak
@ 2026-04-04  6:07                                 ` John Stultz
  2026-04-06  2:40                                   ` K Prateek Nayak
  0 siblings, 1 reply; 56+ messages in thread
From: John Stultz @ 2026-04-04  6:07 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Apr 3, 2026 at 10:49 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello John,
>
> On 4/4/2026 5:56 AM, John Stultz wrote:
> > I'm currently testing with my best guess of the combined suggestions
> > you've both tossed into this thread. Unfortunately I still trip over
> > state == TASK_RUNNING in proxy_deactivate(), so I'm trying to debug
> > that now.
> > (I think the issue is we hit the ttwu_queue_wakelist() case without
> > clearing PROXY_WAKING, so we need to clear_task_blocked_on() earlier
> > in ttwu, likely right after setting TASK_WAKING - that's looking ok so
> > far).
>
> That makes sense! I forgot we had that path for "p->on_cpu". Thank you
> for chasing that.
>
> Also looking at proxy-exec-v27-WIP, I think that:
>
>     set_task_blocked_on_waking(p, NULL)
>
> in try_to_wake_up() can be moved after ttwu_state_match() since
> otherwise, we skip waking the task but ttwu() will end up marking
> it PROXY_WAKING for a spurious wakeup.
>
> I don't think there is any harm in keeping it that way but during the
> next pick, __schedule() will end up seeing PROXY_WAKING and will block
> the task.
>
> Until it receives a genuine wakeup, it cannot participate in proxy which
> is a missed opportunity.

So, actually, that caught my eye as well, and I think it can be
dropped completely.  I just didn't have the time to experiment and
respin the series when I noticed it.

My thinking:
1) Since including your recent suggestions, proxy_needs_return()
handles both cases where its blocked_on a mutex or PROXY_WAKING.
2) If its not on the runqueue, we'll clear blocked_on anyway after
setting TASK_WAKING

So it seems like its harmless, but also unnecessary.  I'll make sure
next week and will drop it in the next revision.


> > After I get this into a stable state, I'll try to polish it up a bit
> > and then re-layer the rest of the proxy-exec series on top (I do fret
> > the sleeping owner enqueueing will be more complicated).
>
> Ack! Let me go stare at it for a while but since you say it hasn't
> outright crashed your system, you may have taken care of everything
> already ;-)

I'll leave some testing running over the weekend to see if I catch anything.

Next week I'll do some trace analysis to make sure blocked owner
enqueuing is really behaving properly, which is my main concern right
now. Also need to get a sense of the performance situation to make
sure we're not regressing any further.

thanks
-john

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-04  6:07                                 ` John Stultz
@ 2026-04-06  2:40                                   ` K Prateek Nayak
  0 siblings, 0 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-04-06  2:40 UTC (permalink / raw)
  To: John Stultz
  Cc: Peter Zijlstra, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 4/4/2026 11:37 AM, John Stultz wrote:
>> Also looking at proxy-exec-v27-WIP, I think that:
>>
>>     set_task_blocked_on_waking(p, NULL)
>>
>> in try_to_wake_up() can be moved after ttwu_state_match() since
>> otherwise, we skip waking the task but ttwu() will end up marking
>> it PROXY_WAKING for a spurious wakeup.
>>
>> I don't think there is any harm in keeping it that way but during the
>> next pick, __schedule() will end up seeing PROXY_WAKING and will block
>> the task.
>>
>> Until it receives a genuine wakeup, it cannot participate in proxy which
>> is a missed opportunity.
> 
> So, actually, that caught my eye as well, and I think it can be
> dropped completely.  I just didn't have the time to experiment and
> respin the series when I noticed it.
> 
> My thinking:
> 1) Since including your recent suggestions, proxy_needs_return()
> handles both cases where its blocked_on a mutex or PROXY_WAKING.
> 2) If its not on the runqueue, we'll clear blocked_on anyway after
> setting TASK_WAKING
> 
> So it seems like its harmless, but also unnecessary.  I'll make sure
> next week and will drop it in the next revision.

Ack! I just thought you wanted to keep a clean:

  blocked_on: NULL -> MUTEX -> PROXY_WAKING -> NULL

transition in all cases ;-)

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-03 10:25                   ` K Prateek Nayak
  2026-04-03 11:28                     ` Peter Zijlstra
@ 2026-04-03 12:54                     ` Peter Zijlstra
  1 sibling, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2026-04-03 12:54 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: John Stultz, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Apr 03, 2026 at 03:55:22PM +0530, K Prateek Nayak wrote:

> >>  	if (sched_proxy_exec() && p->blocked_on) {
> > 
> > So I had doubts about this lockless test of ->blocked_on, I still cannot
> > convince myself it is correct.
> 
> Let me give a try: A task's "blocked_on" starts off as a valid mutex and
> can be transitioned optionally to PROXY_WAKING (!= NULL) before being
> cleared.
> 
> If blocked_on is cleared directly, PROXY_WAKING transition never
> happens even if someone does set_task_blocked_on_waking() since we bail
> out early if !p->blocked_on.
> 
> All "p->blocked_on" transition happen with "blocked_on_lock" held.
> 
> So that begs the question, when is "blocked_on" actually cleared?
> 
> 1) If the task is task_on_rq_queued(), we either clear it in schedule()
>    (find_proxy_task() to be precise) or in ttwu_runnable() - both with
>    rq_lock held.
> 
> 2) *NEW* If the task is off rq and is waking up, it means there is a
>    ttwu_state_match() and without proxy, the task would have woken up
>    and executed on the CPU.
> 
>    Since the task is completely off rq, schedule() cannot clear the
>    p->blocked_on. Only other remote transition possible is to
>    PROXY_WAKING (!= NULL).
> 
>    So *inspecting* the p->blocked_on relation without the
>    blocked_on_lock held should be fine to know if the task has a
>    blocked_on relation.
> 
> Only the task itself can set "p->blocked_on" to a valid mutex when
> running on the CPU so it is out of question we can suddenly get a
> transition to a new mutex when we are in schedule() or in middle of
> waking the task.

So my consideration was:

__mutex_lock_common()
  ...
  raw_spin_lock(&current->blocked_lock);
  __set_task_blocked_on(current, lock)
    current->blocked_on = lock;
  set_current_state(state)
    current->__state = state;
    smp_mb();

This means we have:

  LOCK
  [W] ->blocked_on = lock
  [W] ->__state = state;
  MB

Then consider:

try_to_wake_up()
  ...
  raw_spin_lock_irqsave(&p->lock);
  if (ttwu_state_match(p, state, &success))
    ...
  smp_rmb();
  if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
    if (sched_proxy_exec() && p->blocked_on)
    	

This is effectively:

  ACQUIRE
  [R] ->__state
  RMB
  [R] ->blocked_on


Combined this gives:

  CPU0				CPU1

  LOCK				ACQUIRE
  [W] ->blocked_on = lock       [R] ->__state
  [W] ->__state = state;        RMB
  MB                            [R] ->blocked_on

And that is *NOT* properly ordered. It is possible to observe [W]
__state and pass ttwu_state_match() and NOT observe [W] ->blocked_on and
see !->blocked_on.

(on weakly ordered machines, obviously)

So that does a ttwu() but will 'retain' ->blocked_on -- which violates
the model. Which is about where I got.


That said; this race, while valid, doesn't actually harm. Because as you
say, this means that CPU1 is in the middle of mutex_lock() and will
observe the wakeup and cancel the block and clean up ->blocked_on
itself.

So yeah, I think we're good.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-04-02 18:31             ` John Stultz
  2026-04-02 21:04               ` John Stultz
  2026-04-03  6:09               ` K Prateek Nayak
@ 2026-04-03  9:18               ` Peter Zijlstra
  2 siblings, 0 replies; 56+ messages in thread
From: Peter Zijlstra @ 2026-04-03  9:18 UTC (permalink / raw)
  To: John Stultz
  Cc: K Prateek Nayak, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Thu, Apr 02, 2026 at 11:31:56AM -0700, John Stultz wrote:

> So I like getting rid of proxy_force_return(), but its not clear to me
> that proxy_deactivate() is what we want to do in these
> find_proxy_task() edge cases.
> 
> It feels like if we are already racing with ttwu, deactivating the
> task seems like it might open more windows where we might lose the
> wakeup.
> 
> In fact, the whole reason we have proxy_force_return() is that earlier
> in the proxy-exec development, when we hit those edge cases we usually
> would return proxy_reschedule_idle() just to drop the rq lock and let
> ttwu do its thing, but there kept on being cases where we would end up
> with lost wakeups.
> 
> But I'll give this a shot (and will integrate your ttwu_runnable
> cleanups regardless) and see how it does.

So the main idea is that ttwu() will be in charge of migrating back, as
one an only means of doing so.

This includes signals and unlock and everything.

This means that there are two main cases:

 - ttwu() happens first and finds the task on_rq; we hit
   ttwu_runnable().

 - schedule() happens first and hits this task without means of going
   forward.

Lets do the second first; this is handled by doing dequeue. It must take
the task off the runqueue, so it can select another task and make
progress. But this had me hit those proxy_deactivate() failure cases,
those must not exist.

The first is that deactivate can encounter TASK_RUNNING, this must not
be, because TASK_RUNNING would mean ttwu() has happened and that would
then have sorted everything out.

The second is that signal case, which again should not happen, because
the signal ttwu() should sort it all out. We just want to take the task
off the runqueue here.

Now the ttwu() case. So if it is first it will hit ttwu_runnable(), but
we don't want this case. So instead we dequeue the task and say: 'nope,
wasn't on_rq', which proceeds into the 'normal' wakeup path which does a
migration.

And note, that if proxy_deactivate() happened first, we simply skip that
first step and directly go into the normal wakeup path.

There is no worry about ttwu() going missing, ttwu() is changed to make
sure any ->TASK_RUNNING transition ensures ->blocked_on gets cleared and
the task ends up on a suitable CPU.

Anyway, that is the high level idea, like said I didn't get around to
doing all the details (and I clearly missed a few :-).

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-03-27 11:48   ` Peter Zijlstra
  2026-03-27 13:33     ` K Prateek Nayak
@ 2026-03-27 19:15     ` John Stultz
  1 sibling, 0 replies; 56+ messages in thread
From: John Stultz @ 2026-03-27 19:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: K Prateek Nayak, LKML, Joel Fernandes, Qais Yousef, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Fri, Mar 27, 2026 at 4:48 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Anyway, you seem to want to drive the return migration from the regular
> wakeup path and I don't mind doing that, provided it isn't horrible. But
> we can do this on top of these patches, right?
>
> That is, I'm thinking of taking these patches, they're in reasonable
> shape, and John deserves a little progress :-)
>
> I did find myself liking the below a little better, but I'll just sneak
> that in.

I was expecting to respin with some of Prateek's and Steven's feedback
(and include your suggested switch to using goto to get out of the
locking scope), but I'd totally not object to you taking this set
(with whatever tweaks you'd prefer).

Do let me know when you can share your queue and I'll rebase and
rework the rest of the series along with any un-integrated feedback to
this set.

thanks
-john

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-03-25 10:52 ` [PATCH v26 00/10] Simple Donor Migration for Proxy Execution K Prateek Nayak
  2026-03-27 11:48   ` Peter Zijlstra
@ 2026-03-27 19:10   ` John Stultz
  2026-03-28  4:53     ` K Prateek Nayak
  1 sibling, 1 reply; 56+ messages in thread
From: John Stultz @ 2026-03-27 19:10 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

On Wed, Mar 25, 2026 at 3:52 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> On 3/25/2026 12:43 AM, John Stultz wrote:
> > There’s also been some further improvements In the full Proxy
> > Execution series:
> > * Tweaks to proxy_needs_return() suggested by K Prateek
>
> To answer your question on v25, I finally seem to have
> ttwu_state_match() happy with the pieces in:
> https://github.com/kudureranganath/linux/commits/kudure/sched/proxy/ttwu_state_match/
>
> The base rationale is still the same from
> https://lore.kernel.org/lkml/eccf9bb5-8455-48e5-aa35-4878c25f6822@amd.com/

So thank you so much for sharing this tree! It's definitely helpful
and better shows how to split up the larger proposal you had.

I've been occupied chasing the null __pick_eevdf() return issue (which
I've now tripped without my proxy changes, so its an upstream thing
but I'd still like to bisect it down), along with other items, so I've
not yet been able to fully ingest your changes. I did run some testing
on them and didn't see any immediate issues (other then the null
__pick_eevdf() issue, which limits the testing time to ~4 hours), and
I even ran it along with the sleeping owner enqueuing change on top
which had been giving me grief in earlier attempts to integrate these
suggestions.  So that's good!

My initial/brief reactions looking through your the series:

* sched/core: Clear "blocked_on" relation if schedule races with wakeup

At first glance, this makes me feel nervous because clearing the
blocked_on value has long been a source of bugs in the development of
the proxy series, as the task might have been proxy-migrated to a cpu
where it can't run. That's why my mental rules tend towards doing the
clearing in a few places and setting PROXY_WAKING in most cases (so
we're sure to evaluate the task before letting it run).  My earlier
logic of keeping blocked_on_state separate from blocked_on was trying
to make these rules as obvious as possible, and since consolidating
them I still get mentally muddied at times - ie, we probably don't
need to be clearing blocked_on in the mutex lock paths anymore, but
the symmetry is a little helpful to me.

But the fact that you're clearing the state on prev here, and at that
point prev is current saves it, since current can obviously run on
this cpu. So probably just needs a comment to that effect.

* sched/core: Handle "blocked_on" clearing for wakeups in ttwu_runnable()

Mostly looks sane to me (though I still have some heistancy to
dropping the set_task_blocked_on_waking() bit)

* sched/core: Remove "p->wake_cpu" constraint in proxy_needs_return()

Yeah, that's a sound call, the shortcut isn't necessary and just adds
complexity.

* sched/core: Allow callers of try_to_block_task() to handle
"blocked_on" relation

Seems like it could be pulled up earlier in the series? (with your first change)

* sched/core: Prepare proxy_deactivate() to comply with ttwu state machinery

This one I've not totally gotten my head around, still.  The
"WRITE_ONCE(p->__state, TASK_RUNNING);"  in find_proxy_task() feels
wrong, as it looks like we're overriding what ttwu should be handling.
But again, this is only done on current, so it's probably ok.
Similarly the clear_task_blocked_on() in proxy_deactivate() doesn't
make it clear how we ensure we're not proxy-migrated, and the
clear_task_blocked_on() in __block_task() feels wrong to me, as I
think we will need that for sleeping owner enqueuing.

But again, this didn't crash (at least right away), so it may just be
I've not fit it into my mental model yet and I'll get it eventually.

* sched/core: Remove proxy_task_runnable_but_waking()

Looks lovely, but obviously depends on the previous changes.

* sched/core: Simplify proxy_force_return()

Again, I really like how much that simplifies the logic! But I'm
hesitant as my previous attempts to do similar didn't work, and it
seems it depends on the ttwu state machinery change I've not fully
understood.

* sched/core: Reset the donor to current task when donor is woken

Looks nice! I fret there may be some subtlety I'm missing, but once I
get some confidence in it, I'll be happy to have it.

Anyway, apologies I've not had more time to spend on your feedback
yet.  I was hoping to start integrating and folding in your proposed
changes for another revision (if you are ok with that - I can keep
them separate as well, but it feels like more churn for reviewers),
but with Peter sounding like he's in-progress on queueing the current
set (with modifications), I want to wait to see if we should just work
this out on top of what he has (which I'm fine with).

As always, many many thanks for your time and feedback here! I really
appreciate your contributions to this effort!
-john

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v26 00/10] Simple Donor Migration for Proxy Execution
  2026-03-27 19:10   ` John Stultz
@ 2026-03-28  4:53     ` K Prateek Nayak
  0 siblings, 0 replies; 56+ messages in thread
From: K Prateek Nayak @ 2026-03-28  4:53 UTC (permalink / raw)
  To: John Stultz
  Cc: LKML, Joel Fernandes, Qais Yousef, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Valentin Schneider,
	Steven Rostedt, Ben Segall, Zimuzo Ezeozue, Mel Gorman,
	Will Deacon, Waiman Long, Boqun Feng, Paul E. McKenney,
	Metin Kaya, Xuewen Yan, Thomas Gleixner, Daniel Lezcano,
	Suleiman Souhlal, kuyo chang, hupu, kernel-team

Hello John,

On 3/28/2026 12:40 AM, John Stultz wrote:
> On Wed, Mar 25, 2026 at 3:52 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>> On 3/25/2026 12:43 AM, John Stultz wrote:
>>> There’s also been some further improvements In the full Proxy
>>> Execution series:
>>> * Tweaks to proxy_needs_return() suggested by K Prateek
>>
>> To answer your question on v25, I finally seem to have
>> ttwu_state_match() happy with the pieces in:
>> https://github.com/kudureranganath/linux/commits/kudure/sched/proxy/ttwu_state_match/
>>
>> The base rationale is still the same from
>> https://lore.kernel.org/lkml/eccf9bb5-8455-48e5-aa35-4878c25f6822@amd.com/
> 
> So thank you so much for sharing this tree! It's definitely helpful
> and better shows how to split up the larger proposal you had.
> 
> I've been occupied chasing the null __pick_eevdf() return issue (which
> I've now tripped without my proxy changes, so its an upstream thing
> but I'd still like to bisect it down),

Is the __pick_eevdf() returning NULL on tip:sched/core or is this on
mainline?

_Insert GTA San Andreas "Here we go again." meme here_

> along with other items, so I've
> not yet been able to fully ingest your changes. I did run some testing
> on them and didn't see any immediate issues (other then the null
> __pick_eevdf() issue, which limits the testing time to ~4 hours), and
> I even ran it along with the sleeping owner enqueuing change on top
> which had been giving me grief in earlier attempts to integrate these
> suggestions.  So that's good!
> 
> My initial/brief reactions looking through your the series:
> 
> * sched/core: Clear "blocked_on" relation if schedule races with wakeup
> 
> At first glance, this makes me feel nervous because clearing the
> blocked_on value has long been a source of bugs in the development of
> the proxy series, as the task might have been proxy-migrated to a cpu
> where it can't run. That's why my mental rules tend towards doing the
> clearing in a few places and setting PROXY_WAKING in most cases (so
> we're sure to evaluate the task before letting it run).  My earlier
> logic of keeping blocked_on_state separate from blocked_on was trying
> to make these rules as obvious as possible, and since consolidating
> them I still get mentally muddied at times - ie, we probably don't
> need to be clearing blocked_on in the mutex lock paths anymore, but
> the symmetry is a little helpful to me.
> 
> But the fact that you're clearing the state on prev here, and at that
> point prev is current saves it, since current can obviously run on
> this cpu. So probably just needs a comment to that effect.

Ack!

> 
> * sched/core: Handle "blocked_on" clearing for wakeups in ttwu_runnable()
> 
> Mostly looks sane to me (though I still have some heistancy to
> dropping the set_task_blocked_on_waking() bit)
> 
> * sched/core: Remove "p->wake_cpu" constraint in proxy_needs_return()
> 
> Yeah, that's a sound call, the shortcut isn't necessary and just adds
> complexity.
> 
> * sched/core: Allow callers of try_to_block_task() to handle
> "blocked_on" relation
> 
> Seems like it could be pulled up earlier in the series? (with your first change)
> 
> * sched/core: Prepare proxy_deactivate() to comply with ttwu state machinery
> 
> This one I've not totally gotten my head around, still.  The
> "WRITE_ONCE(p->__state, TASK_RUNNING);"  in find_proxy_task() feels
> wrong, as it looks like we're overriding what ttwu should be handling.

So the reason for that is, we can have:

  CPU0 (owner - A)                       CPU1 (donor - B)
  ================                       ================

mutex_unlock(M)
  atomic_long_try_cmpxchg_release()     /* B is just trying to acquire the mutex. */
  ...                                    schedule() /* prev = B, next = B; B is blocked on A */
                                           find_proxy_task()
                                             ...
                                             owner = __mutex_owner(M);
                                             if (!owner && task_current(rq, B))
                                                __clear_task_blocked_on(p, NULL)
                                                return B
  __set_task_blocked_on_waking(B, M);    ... /* B starts running without TASK_RUNNING. */
    /* nop since !B->blocked_on */
                                         /*
                                          * Scenario 1 - B gets mutex and then sets
                                          * TASK_RUNNING on its own.
                                          */
  /* Scenario 2 */
  wake_q_add(B)
    wake_up_process()
      ttwu_state_match() /* true */
        B->__state = TASK_RUNNING;

So in either case, task will wake up and set TASK_RUNNING so we
can just do the pending bits of wakeup in __schedule(). I think
even without an explicit TASK_RUNNING it should be fine but I
need to jog my memory on why I added that (maybe for caution).

If the task fails to acquire mutex, it'll reset to blocked
state and go into schedule() and everything should just work
out fine.

> But again, this is only done on current, so it's probably ok.
> Similarly the clear_task_blocked_on() in proxy_deactivate() doesn't
> make it clear how we ensure we're not proxy-migrated,

So the rationale behind that was, we should *never* hit that
condition but if we are, perhaps we can simply do a move_queued_task()
back to "wake_cpu" to ensure correctness?

> and the
> clear_task_blocked_on() in __block_task() feels wrong to me, as I
> think we will need that for sleeping owner enqueuing.

Yes, for sleeping owner that is not the ideal place - I completely
agree with that. Let me go stare at that find a better place to
put it.

> 
> But again, this didn't crash (at least right away), so it may just be
> I've not fit it into my mental model yet and I'll get it eventually.

Yeah, but then you lose the "blocked_on" chain when you deactivate
the donors only for it to be reconstructed back by running the task
for a little bit and re-establishing that relation so although
it might not have crashed (yet!), it is pretty inefficient.

I'll go stare more at that.

> 
> * sched/core: Remove proxy_task_runnable_but_waking()
> 
> Looks lovely, but obviously depends on the previous changes.
> 
> * sched/core: Simplify proxy_force_return()
> 
> Again, I really like how much that simplifies the logic! But I'm
> hesitant as my previous attempts to do similar didn't work, and it
> seems it depends on the ttwu state machinery change I've not fully
> understood.

Highly intertwined indeed! I'll try to add more comments and improve
the commit messages.

> 
> * sched/core: Reset the donor to current task when donor is woken
> 
> Looks nice! I fret there may be some subtlety I'm missing, but once I
> get some confidence in it, I'll be happy to have it.

Ack! I too will keep testing. Btw, do you have something that stresses
the deadline bits? I can't seem to reliably get something running with
lot of preemptions when holding mutexes.

> 
> Anyway, apologies I've not had more time to spend on your feedback
> yet.  I was hoping to start integrating and folding in your proposed
> changes for another revision (if you are ok with that - I can keep
> them separate as well, but it feels like more churn for reviewers),
> but with Peter sounding like he's in-progress on queueing the current
> set (with modifications), I want to wait to see if we should just work
> this out on top of what he has (which I'm fine with).

Ack! None of this is strictly necessary until we get to ttwu handling
the return migration so it should be okay. If you are occupied, I can
test and send these changes on top separately too to ease some load.

> 
> As always, many many thanks for your time and feedback here! I really
> appreciate your contributions to this effort!

And thanks a ton for looking at the tree. Much appreciated _/\_

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2026-04-06  2:41 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-24 19:13 [PATCH v26 00/10] Simple Donor Migration for Proxy Execution John Stultz
2026-03-24 19:13 ` [PATCH v26 01/10] sched: Make class_schedulers avoid pushing current, and get rid of proxy_tag_curr() John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 02/10] sched: Minimise repeated sched_proxy_exec() checking John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 03/10] sched: Fix potentially missing balancing with Proxy Exec John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 04/10] locking: Add task::blocked_lock to serialize blocked_on state John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 05/10] sched: Fix modifying donor->blocked on without proper locking John Stultz
2026-03-26 21:45   ` Steven Rostedt
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 06/10] sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy return-migration John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 07/10] sched: Add assert_balance_callbacks_empty helper John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 08/10] sched: Add logic to zap balance callbacks if we pick again John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 09/10] sched: Move attach_one_task and attach_task helpers to sched.h John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-24 19:13 ` [PATCH v26 10/10] sched: Handle blocked-waiter migration (and return migration) John Stultz
2026-03-26 22:52   ` Steven Rostedt
2026-03-27  4:47     ` K Prateek Nayak
2026-03-27 12:47       ` Peter Zijlstra
2026-04-02 14:43   ` Peter Zijlstra
2026-04-02 15:08     ` Peter Zijlstra
2026-04-02 17:43       ` John Stultz
2026-04-02 17:34     ` John Stultz
2026-04-03 12:30   ` [tip: sched/core] " tip-bot2 for John Stultz
2026-03-25 10:52 ` [PATCH v26 00/10] Simple Donor Migration for Proxy Execution K Prateek Nayak
2026-03-27 11:48   ` Peter Zijlstra
2026-03-27 13:33     ` K Prateek Nayak
2026-03-27 15:20       ` Peter Zijlstra
2026-03-27 15:41         ` Peter Zijlstra
2026-03-27 16:00       ` Peter Zijlstra
2026-03-27 16:57         ` K Prateek Nayak
2026-04-02 15:50           ` Peter Zijlstra
2026-04-02 18:31             ` John Stultz
2026-04-02 21:04               ` John Stultz
2026-04-03  6:09               ` K Prateek Nayak
2026-04-03  9:52                 ` Peter Zijlstra
2026-04-03 10:25                   ` K Prateek Nayak
2026-04-03 11:28                     ` Peter Zijlstra
2026-04-03 13:43                       ` K Prateek Nayak
2026-04-03 14:38                         ` Peter Zijlstra
2026-04-03 15:39                           ` K Prateek Nayak
2026-04-03 21:08                             ` Peter Zijlstra
2026-04-04  0:26                             ` John Stultz
2026-04-04  5:49                               ` K Prateek Nayak
2026-04-04  6:07                                 ` John Stultz
2026-04-06  2:40                                   ` K Prateek Nayak
2026-04-03 12:54                     ` Peter Zijlstra
2026-04-03  9:18               ` Peter Zijlstra
2026-03-27 19:15     ` John Stultz
2026-03-27 19:10   ` John Stultz
2026-03-28  4:53     ` K Prateek Nayak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox